CN105224520B - A kind of Chinese patent document term automatic identifying method - Google Patents

A kind of Chinese patent document term automatic identifying method Download PDF

Info

Publication number
CN105224520B
CN105224520B CN201510623936.4A CN201510623936A CN105224520B CN 105224520 B CN105224520 B CN 105224520B CN 201510623936 A CN201510623936 A CN 201510623936A CN 105224520 B CN105224520 B CN 105224520B
Authority
CN
China
Prior art keywords
mrow
msub
candidate terms
speech
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510623936.4A
Other languages
Chinese (zh)
Other versions
CN105224520A (en
Inventor
吕学强
董志安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201510623936.4A priority Critical patent/CN105224520B/en
Publication of CN105224520A publication Critical patent/CN105224520A/en
Application granted granted Critical
Publication of CN105224520B publication Critical patent/CN105224520B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention relates to a kind of Chinese patent document term automatic identifying method, comprise the following steps:Step 1):Part-of-speech rule is automatically generated based on patent title;Step 2):Manual construction disables vocabulary;Step 3):The part-of-speech rule of generation is classified according to the number of contained part of speech;Step 4):Candidate terms are ranked up using TermRank sort algorithms.The present invention learns from patent title the part-of-speech rule of composition term automatically first with statistical method, solves the artificial deficiency for summarizing term part-of-speech rule;Candidate terms are ranked up using TermRank sort methods, linguistics and statistics feature in patent document has been considered, can preferably distinguish term and non-term, there is higher reliability, the needs of practical application can be met well.

Description

A kind of Chinese patent document term automatic identifying method
Technical field
The invention belongs to Chinese terminology automatic identification technology field, and in particular to a kind of Chinese patent document term is known automatically Other method.
Background technology
Contain a large amount of field terms in Chinese patent document, it is information extraction, text to carry out automatic identification to these terms The vital task in the fields such as excavation.Automatic term identification (Automatic Term Recognition, ATR) is that information extraction is ground Study carefully the important component in field.It refers to by no manual intervention or as far as possible few Manual intervention method, from free text Automatically identify the process for the vocabulary string that can represent universal in some professional domain.Pass through term automatic identification technology structure The terminology bank built is very important basic data resource, is Chinese word segmentation, ontological construction, dictionary writing and renewal, automatic mark Draw, information retrieval and machine translation etc. provide indispensable data and supported.In addition, sent out along with the high speed of information technology Exhibition, digital information source increase severely with day, and these resources are carried out with the automatic identification of term for holding the newest hair in field in time Exhibition situation and future developing trend tool are of great significance.
Chinese patent document is important digital information source, they recite the last word of each ambit into Fruit, wherein there is substantial amounts of technical term.With reference to the observation analysis to Chinese patent document and the research of forefathers, it can be found that Term in patent document has following several evident characteristics:(1) the term nesting phenomenon in patent document is relatively conventional;(2) Term in patent document has stronger field correlation, i.e. high-frequency appears in the term in a certain field in other field Middle low frequency occurrence even occurs without;(3) term in patent document has the characteristics of repeating, i.e., term is in whole patent text Offer and occur in more documents of concentration;(4) patent term length is longer, is generally made up of 2-5 words;(5) patent term is mostly It is made up of noun or complex noun.The characteristics of terms above be to Chinese patent document carry out term automatic identification it is important according to According to.
At present, conventional term automatic identifying method mainly has two kinds:
The first is traditional rule with counting the term recognition methods being combined:In the process of generation candidate terms collection In, first Chinese text is segmented and part-of-speech tagging processing, the language material marked by observation sum up the word for forming term Property rule set, generation candidate terms collection is matched in language material using these part-of-speech rules;By the side of manual compiling part-of-speech rule Too big to the linguistic knowledge dependence of author although formula accuracy of identification is higher, different people is write to same language material Part-of-speech rule is not consistent;Although part-of-speech rule need not be utilized obtaining these methods of candidate terms stage, in distich Too big to the dependent resource of outside during the son thick cutting of progress, the quality of external resource often determines obtained candidate terms collection Quality;Aspect is being ranked up to candidate terms collection, currently conventional sort algorithm is present for the shorter art of identification length Language or the frequency of occurrences it is relatively low term it is undesirable the defects of;
The method of second of identification term is to use gradually to tend to the machine of study hotspot in information extraction field in recent years Learning algorithm, it is that its scale and quality requirement to training corpus is higher the defects of machine learning algorithm, and needs artificial mark Mass data is noted, the training of language material is also required to spend longer time.
In addition, the candidate terms sort algorithm of main flow is undesirable to the shorter term recognition effect of length at present.
The content of the invention
For above-mentioned problems of the prior art, it is an object of the invention to provide one kind can avoid the occurrence of above-mentioned skill The Chinese patent document term automatic identifying method of art defect.
In order to realize foregoing invention purpose, the technical solution adopted by the present invention is as follows:
A kind of Chinese patent document term automatic identifying method, comprises the following steps:
Step 1):Part-of-speech rule is automatically generated based on patent title, cut patent title using Chinese lexical analysis system It is divided into substring and stop words, using the stop words as separator, the part-of-speech rule of the substring is extracted, and as life Into the part-of-speech rule of candidate terms;
Step 2):Manual construction disables vocabulary, and stop words is added and disabled in vocabulary;
Step 3):The part-of-speech rule of generation is classified according to the number of contained part of speech, to every one kind word Property rule according to frequency of occurrences descending arrange, and only take the rules of Top 5 be applied to Chinese patent document body part carry out Part of speech matches, and generates candidate terms set, then the candidate terms extracted are classified according to the number of included word;
Step 4):Candidate terms are ranked up using TermRank sort algorithms, the TermRank sort algorithms are determined Justice is as shown in formula (1):
Wherein, TiFor candidate terms, TR (Ti) it is candidate terms TiTermRank values;M is to include candidate terms TiIt is special Sharp quantity of document;To include candidate terms TiPatent document djMiddle TiWord frequency;C(dj) it is patent document djIn take out The candidate terms quantity of taking-up;|Ti| it is candidate terms TiLength, count (Ti) it is candidate terms TiIn the stop words that includes Quantity;
Its TermRank value is calculated according to formula (1) to each described candidate terms in candidate terms list, through row After sequence, Top-N bars are taken as final nomenclature.
Further, the step 2) specifically chooses stop words to build deactivation vocabulary using following three kinds of methods:
Method one:Word frequency statisticses are carried out after being segmented to patent title, stop words of the frequency higher than 20 is will appear from and adds deactivation Vocabulary;
Method two:The part of speech substantially not appeared in term is added and disables vocabulary;
Method three:The deactivation vocabulary generated using methods described one and methods described two filters to the patent title Afterwards, the remaining word string in the patent title is manually observed, if finding new stop words again, also adds it to deactivation In vocabulary.
Further, in the step 3), the part-of-speech rule is divided into four classes, i.e. 2 word part-of-speech rules, 3 word parts of speech Rule, 4 word part-of-speech rules and 5 word part-of-speech rules.
Further, in the step 3), the candidate terms are divided into four classes, i.e. 2 word candidate terms, 3 word candidates Term, 4 word candidate terms and 5 word candidate terms.
Further, in the step 4), when M values are larger or smaller, formula (2) and formula (3) are utilized respectively The Section 1 and Section 2 of the formula (1) are normalized, wherein, the formula (2) and formula (3) are respectively:
{|Ti|×count(Ti)-min|Ti|×count(Ti)}/{max|Ti|×count(Ti)-min|Ti|×count (Ti)} (3)。
Chinese patent document term automatic identifying method provided by the invention, first with statistical method from patent title In learn the part-of-speech rule of composition term automatically, solve the artificial deficiency for summarizing term part-of-speech rule;Using TermRank Sort method is ranked up to candidate terms, has considered linguistics and statistics feature in patent document, can be preferable Differentiation term and non-term, there is higher reliability, the needs of practical application can be met well.
Brief description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is Chinese patent title formalization representation schematic diagram.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and specific implementation The present invention will be further described for example.It should be appreciated that specific embodiment described herein is only to explain the present invention, and do not have to It is of the invention in limiting.
As shown in figure 1, a kind of Chinese patent document term automatic identifying method, comprises the following steps:
Step 1):Part-of-speech rule is automatically generated based on patent title, cut patent title using Chinese lexical analysis system It is divided into substring and stop words, using the stop words as separator, the part-of-speech rule of the substring is extracted, and as life Into the part-of-speech rule of candidate terms;
Patent document be usually to invention, utility model, appearance design record, its title is the height to whole document Summarize, therefore often directly give the object to be described.A correct term is all comprised at least in the title of patent document. According to the above feature of Chinese patent title, title is formally expressed as shown in Fig. 2 wherein, wi(i=1,2 ... n) Represent the word that patent title is syncopated as by ICTCLAS, w1…wa, wc…wdAnd wf…wnFor the term in title, represent respectively For CT1, CT2, CT3;wbAnd weIt is to be not belonging to the word that any term forms part, is called stop words herein, its construction method exists 3.2 sections are introduced.
With stop words ST1, ST2 is separator, and by substring CT1, CT2, CT3 part-of-speech rule extract, you can as under One step generates the part-of-speech rule of candidate terms.For example, patent title:" one/m kinds/q is electronic/b automobiles/n /ude1 electricity/n shows Show and term is included in device/n devices/n ":" electronic/b automobiles/n ", " electricity/n displays/n ".Extract their part-of-speech rule: " b+n ", " n+n ", and they are concentrated added to part-of-speech rule, the part-of-speech rule of candidate terms is generated as next step.
Step 2):Manual construction disables vocabulary, and stop words is added and disabled in vocabulary;
Stop words is the valuable source that part-of-speech rule is automatically generated from patent title.Present invention selection manual construction disables Vocabulary, rather than ready-made general deactivation vocabulary is directly used, it is because of some stop words in ready-made general deactivation vocabulary The part of term is likely to be in the patent literature.For example, " row/v " exists in general deactivation vocabulary, but " entirely certainly In dynamic/b rows/v paper/n machines/ng ", it is the part for forming term again, therefore can not be added into deactivation vocabulary.Similar " row/ V " is this kind of to be existed in general deactivation vocabulary, but in Chinese patent document be again the part for forming term word it is big in language material Amount is present.
Stop words is specifically chosen using following three kinds of methods to build deactivation vocabulary:
Method one:Word frequency statisticses are carried out after being segmented to patent title, stop words of the frequency higher than 20 is will appear from and adds deactivation Vocabulary;
Method two:The part of speech substantially not appeared in term is added and disables vocabulary;
Method three:The deactivation vocabulary generated using methods described one and methods described two filters to the patent title Afterwards, the remaining word string in the patent title is manually observed, if finding new stop words again, also adds it to deactivation In vocabulary.
Step 3):The part-of-speech rule of generation is classified according to the number of contained part of speech, the part of speech automatically generated Regular quantity is more, all of which can not be applied to progress term matching in document, it is therefore desirable to selectively therefrom select Go out part part-of-speech rule, in step 3), the part-of-speech rule is divided into four classes, i.e. 2 word part-of-speech rules, 3 word part-of-speech rules, 4 Word part-of-speech rule and 5 word part-of-speech rules;Then every one kind part-of-speech rule is arranged according to frequency of occurrences descending, and only taken The body part that Top5 rules are applied to Chinese patent document carries out part of speech matching, generates candidate terms set, then will take out The candidate terms of taking-up are classified according to the number of included word, and the candidate terms are divided into four classes, i.e. 2 word candidate's arts Language, 3 word candidate terms, 4 word candidate terms and 5 word candidate terms, so the purpose to candidate terms classification is each in order to allow The term of class length is all separately formed a candidate terms table, and the TermRank sort algorithms in it using step 4) are carried out It can not be influenceed during sequence by the term of other length, so as to which ranking results are more fair;
Step 4):Purpose to candidate terms sequence is to determine final nomenclature.One good sort algorithm can The term correctly or incorrectly disperseed in candidate terms list is resequenced, increases the weight of correct term, ranking position Put as far as possible forward, vice versa.Candidate terms are ranked up using TermRank sort algorithms, the TermRank sequences are calculated Method definition is as shown in formula (1):
Wherein, TiFor candidate terms, TR (Ti) it is candidate terms TiTermRank values;M is to include candidate terms TiIt is special Sharp quantity of document;To include candidate terms TiPatent document djMiddle TiWord frequency;C(dj) it is patent document djIn take out The candidate terms quantity of taking-up;|Ti| it is candidate terms TiLength, count (Ti) it is candidate terms TiIn the stop words that includes Quantity;
Its TermRank value is calculated according to formula (1) to each described candidate terms in candidate terms list, through row After sequence, taking Top-N bars, N values 5, that is, take Top-5 bars as final nomenclature herein as final nomenclature.
Section 1 and Section 2 in formula (1) might not on the same order of magnitude, as M > 1000 or M < 2, The TermRank values of candidate terms are influenceed and little, now need that them are normalized respectively.Present invention choosing Linear transformation method for normalizing is selected, is respectively shown in formula (2) and formula (3) to wherein first, second normalization formula:
{|Ti|×count(Ti)-min|Ti|×count(Ti)}/{max|Ti|×count(Ti)-min|Ti|×count (Ti)} (3)。
9725 patent documents are handled using method proposed by the present invention, remove form and picture therein, are protected It is 123M to save as the language material size after plain text.Patent document is segmented using ICTCLAS and part-of-speech tagging is handled.Part of speech Mark uses Computer Department of the Chinese Academy of Science's two level part-of-speech tagging collection, i.e.,《ICTPOS3.0 Chinese part of speech label sets》.Using in step 2) The method for disabling vocabulary is built, stop words totally 246 is included in the deactivation vocabulary finally built.Table 1 lists which part and stopped Word.
Table 1:The artificial constructed part stop words disabled in vocabulary
Experimental result is judged using manual type.It is right to avoid the limitation of the subjectivity of people and domain knowledge Respective markers are directly marked in term substantially correctly or incorrectly, and the candidate terms for being difficult discrimination correctness then utilize Google search engine is judged.As long as meeting any one of situations below, then the candidate terms are labeled as correct art Language, otherwise labeled as wrong term:1) corresponding entry in knowledge websites such as Wikipedia, Baidupedia, interactive encyclopaedias be present; 2) this entry be present in patent search system;3) Google search engine is not filtered or beaten to any composition in candidate terms The processing such as random order.
Because experimental result collection is too big, it is difficult to total evaluation is carried out to the list after whole sequence, therefore using P@N evaluations Method, that is, judge the accuracy rate (Precision) of preceding N bars in final nomenclature, and its calculation formula is as follows:
Using the method for automatically generating part-of-speech rule in step 1), from patent document title symbiosis into 2832 without weight Multiple part-of-speech rule.Table 2 lists the Top5 bars after being sorted according to frequency.The statistical result demonstrates most of art from experimental data Language is the characteristics of composition noun or complex noun.
Table 2:Automatically generate the citing of part-of-speech rule Top5 bars
Table 3 is the total frequency (2832) percentage shared by its frequency of occurrence after classifying to part-of-speech rule according to different length Statistical information.The part-of-speech rule that wherein length is 4 and 5 accounts for 71.5% altogether, demonstrates the spy that term length is partially long in patent document Point.
Table 3:The part-of-speech rule ratio of different length
This method by the automatic summary part-of-speech rule from the title of patent document is relative to traditional part-of-speech rule Generation method, there is following both sides advantage:1) redundancy is greatly reduced:Part of speech rule are summarized relative to from patent text Then, the part-of-speech rule that redundancy will be greatly reduced in part of speech is summarized from title;2) to participle and part-of-speech tagging instrument precision according to Rely and reduce:No matter the term in title is correctly or incorrectly segmented and word frequency marks, its part-of-speech rule pattern all will be by Part-of-speech rule is added to concentrate.When extracting candidate terms, if candidate terms by false segmentation and mark, will be also extracted.
Because the part-of-speech rule automatically generated is more, strictly all rules is applied in patent document and extracts candidate terms simultaneously It is unnecessary.Therefore for the part-of-speech rule of every a kind of length, according to the height of frequency of occurrence, Top5 bars are only taken.Table 4 is different length Spend the Top5 bars of part-of-speech rule.
Table 4:Different length part-of-speech rule Top5 bars
Extracted using the part-of-speech rule listed in table 4, then to patent document text.Extract 2 word candidate terms 493286;3 word candidate terms 152274;4 word candidate terms 31809;5 word candidate terms 3966.Table 5 is to extract Part candidate terms and Corresponding matching part-of-speech rule.
The candidate terms quality extracted using part-of-speech rule is higher, but there is also partial noise.For example, candidate terms Although " it is not real term in itself with reference to/v accompanying drawings/n " matching " V+N " part-of-speech rules;Candidate terms " displacement/v sensings The part of speech of " displacement " in device/n " should be n, " voice/n declines/V-shaped/k mammary gland/n somascopes/n " correctly participle and part of speech Mark should be " voice/n formulas/k is miniature/a mammary gland/n somascopes/n ".Although these word strings are mistakenly segmented or part of speech mark Note, but remains as term in itself, and is correctly identified out, and this is exactly of the present invention to automatically generate part-of-speech rule It is in place of advantage, i.e., smaller to the precision dependence of participle and part-of-speech tagging.
Table 5:Part candidate terms and the part-of-speech rule of matching
Candidate terms are divided into different candidate terms tables according to different words length, because only identification length is 2-5 herein Word term, therefore obtain 4 candidate terms tables.Sequence to candidate terms is individually carried out on each candidate terms table, It is in order to avoid causing iniquitous phenomenon to occur so as to be sorted to entirety because the identification of the candidate terms of certain class length is more.This TF and C-Value methods as a comparison are chosen in invention.Table 6 is using P@N evaluation methods to final candidate terms ranking results Statistical information, wherein N value 100,200,400,800,1000 successively.
Table 6:The P@N of candidate terms ranking results are evaluated
Candidate art of the TermRank methods proposed by the present invention to different length it can be seen from the experimental result in table 6 Language sequence effect is all significantly better than other two kinds of sort methods.On P@1000, knowledge of the TermRank methods to 3 word length terms Other accuracy reaches more than 80%.The rule gradually successively decreased from the accuracy on P 100~P of@@1000 is also confirmed TermRank has the ability for preferably distinguishing term and non-term.
Chinese patent document term automatic identifying method provided by the invention, first with statistical method from patent title In learn the part-of-speech rule of composition term automatically, solve the artificial deficiency for summarizing term part-of-speech rule;Using TermRank Sort method is ranked up to candidate terms, has considered linguistics and statistics feature in patent document, can be preferable Differentiation term and non-term, there is higher reliability, the needs of practical application can be met well.
Embodiment described above only expresses embodiments of the present invention, and its description is more specific and detailed, but can not Therefore it is interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection model of the present invention Enclose.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (5)

1. a kind of Chinese patent document term automatic identifying method, it is characterised in that comprise the following steps:
Step 1):Part-of-speech rule is automatically generated based on patent title, is by patent title cutting using Chinese lexical analysis system Substring and stop words, using the stop words as separator, the part-of-speech rule of the substring is extracted, and waited as generation Select the part-of-speech rule of term;
Step 2):Manual construction disables vocabulary, and stop words is added and disabled in vocabulary;
Step 3):The part-of-speech rule of generation is classified according to the number of contained part of speech, every one kind part of speech is advised Then arranged according to frequency of occurrences descending, and only take the rules of Top 5 to be applied to the body part of Chinese patent document and carry out part of speech Matching, candidate terms set is generated, then the candidate terms extracted are classified according to the number of included word;
Step 4):Candidate terms are ranked up using TermRank sort algorithms, the TermRank sort algorithms definition is such as Shown in formula (1):
<mrow> <mi>T</mi> <mi>R</mi> <mrow> <mo>(</mo> <msub> <mi>T</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>|</mo> <msub> <mi>T</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>&amp;times;</mo> <mi>c</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>T</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>(</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow>
Wherein, TiFor candidate terms, TR (Ti) it is candidate terms TiTermRank values;M is to include candidate terms TiPatent text Offer quantity;To include candidate terms TiPatent document djMiddle TiWord frequency;C(dj) it is patent document djIn extract Candidate terms quantity;|Ti| it is candidate terms TiLength, count (Ti) it is candidate terms TiIn the stop words quantity that includes;
Its TermRank value is calculated according to formula (1) to each described candidate terms in candidate terms list, after sorted, Top-N bars are taken as final nomenclature.
2. Chinese patent document term automatic identifying method according to claim 1, it is characterised in that the step 2) tool Body chooses stop words to build deactivation vocabulary using following three kinds of methods:
Method one:Word frequency statisticses are carried out after being segmented to patent title, stop words of the frequency higher than 20 is will appear from and adds deactivation vocabulary;
Method two:The part of speech substantially not appeared in term is added and disables vocabulary;
Method three;After the deactivation vocabulary generated using methods described one and methods described two filters to the patent title, Remaining word string in the patent title is manually observed, if finding new stop words again, also adds it to stop words In table.
3. Chinese patent document term automatic identifying method according to claim 1 or 2, it is characterised in that in the step It is rapid 3) in, the part-of-speech rule is divided into four classes, i.e. 2 word part-of-speech rules, 3 word part-of-speech rules, 4 word part-of-speech rules and 5 word parts of speech Rule.
4. Chinese patent document term automatic identifying method according to claim 1, it is characterised in that in the step 3) In, the candidate terms are divided into four classes, i.e. 2 word candidate terms, 3 word candidate terms, 4 word candidate terms and 5 word candidate terms.
5. Chinese patent document term automatic identifying method according to claim 1, it is characterised in that in the step 4) In, when M values are larger or smaller, it is utilized respectively formula (2) and formula (3) Section 1 and Section 2 to the formula (1) It is normalized, wherein, the formula (2) and formula (3) are respectively:
<mrow> <mrow> <mo>{</mo> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mi>min</mi> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> <mo>}</mo> </mrow> <mo>/</mo> <mrow> <mo>{</mo> <mrow> <mi>max</mi> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mi>min</mi> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> <mo>}</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
{|Ti|×count(Ti)-min|Ti|×count(Ti)}/{max|Ti|×count(Ti)-min|Ti|×count(Ti)} (3)。
CN201510623936.4A 2015-09-28 2015-09-28 A kind of Chinese patent document term automatic identifying method Active CN105224520B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510623936.4A CN105224520B (en) 2015-09-28 2015-09-28 A kind of Chinese patent document term automatic identifying method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510623936.4A CN105224520B (en) 2015-09-28 2015-09-28 A kind of Chinese patent document term automatic identifying method

Publications (2)

Publication Number Publication Date
CN105224520A CN105224520A (en) 2016-01-06
CN105224520B true CN105224520B (en) 2018-03-13

Family

ID=54993498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510623936.4A Active CN105224520B (en) 2015-09-28 2015-09-28 A kind of Chinese patent document term automatic identifying method

Country Status (1)

Country Link
CN (1) CN105224520B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090041B (en) * 2016-11-22 2021-05-18 北京国双科技有限公司 Method and device for generating advertisement creativity
CN107608949B (en) * 2017-10-16 2019-04-16 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
CN108009155A (en) * 2017-12-22 2018-05-08 联想(北京)有限公司 Data processing method and system and server
CN108280198B (en) * 2018-01-29 2021-03-02 口碑(上海)信息技术有限公司 List generation method and apparatus
CN108897736B (en) * 2018-06-20 2022-04-12 大连诺道认知医学技术有限公司 Document sorting method and device based on Paper Rank algorithm
CN109344402B (en) * 2018-09-20 2023-08-04 中国科学技术信息研究所 New term automatic discovery and identification method
CN112487801A (en) * 2020-10-23 2021-03-12 南京航空航天大学 Term recommendation method and system for safety-critical software

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7197449B2 (en) * 2001-10-30 2007-03-27 Intel Corporation Method for extracting name entities and jargon terms using a suffix tree data structure
JP4816409B2 (en) * 2006-01-10 2011-11-16 日産自動車株式会社 Recognition dictionary system and updating method thereof
CN100520782C (en) * 2007-11-09 2009-07-29 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN101655866B (en) * 2009-08-14 2010-12-15 北京中献电子技术开发中心 Automatic decimation method of scientific and technical terminology
US9208145B2 (en) * 2012-05-07 2015-12-08 Educational Testing Service Computer-implemented systems and methods for non-monotonic recognition of phrasal terms
CN103268339B (en) * 2013-05-17 2016-06-01 中国科学院计算技术研究所 Named entity recognition method and system in Twitter message
CN104216880B (en) * 2013-05-29 2017-06-16 北京信息科技大学 Term based on internet defines discrimination method
CN103678656A (en) * 2013-12-23 2014-03-26 合肥工业大学 Unsupervised automatic extraction method of microblog new words based on repeated word strings
CN103778243B (en) * 2014-02-11 2017-02-08 北京信息科技大学 Domain term extraction method
CN103885934B (en) * 2014-02-19 2017-05-03 中国专利信息中心 Method for automatically extracting key phrases of patent documents

Also Published As

Publication number Publication date
CN105224520A (en) 2016-01-06

Similar Documents

Publication Publication Date Title
CN105224520B (en) A kind of Chinese patent document term automatic identifying method
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
Tiedemann et al. Efficient discrimination between closely related languages
CN106294320B (en) A kind of terminology extraction method and system towards academic paper
Butnaru et al. Moroco: The moldavian and romanian dialectal corpus
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
CN107463607A (en) The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
US7962507B2 (en) Web content mining of pair-based data
CN108509409A (en) A method of automatically generating semantic similarity sentence sample
CN105095196B (en) The method and apparatus of new word discovery in text
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN106055560A (en) Method for collecting data of word segmentation dictionary based on statistical machine learning method
CN111191022A (en) Method and device for generating short titles of commodities
CN110399606A (en) A kind of unsupervised electric power document subject matter generation method and system
CN107329960A (en) Unregistered word translating equipment and method in a kind of neural network machine translation of context-sensitive
CN107526841A (en) A kind of Tibetan language text summarization generation method based on Web
CN106708798A (en) String segmentation method and device
CN110705247A (en) Based on x2-C text similarity calculation method
CN109062895A (en) A kind of intelligent semantic processing method
CN112966117A (en) Entity linking method
Regina et al. Clickbait headline detection using supervised learning method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant