CN105224520B - A kind of Chinese patent document term automatic identifying method - Google Patents
A kind of Chinese patent document term automatic identifying method Download PDFInfo
- Publication number
- CN105224520B CN105224520B CN201510623936.4A CN201510623936A CN105224520B CN 105224520 B CN105224520 B CN 105224520B CN 201510623936 A CN201510623936 A CN 201510623936A CN 105224520 B CN105224520 B CN 105224520B
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- candidate terms
- speech
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The present invention relates to a kind of Chinese patent document term automatic identifying method, comprise the following steps:Step 1):Part-of-speech rule is automatically generated based on patent title;Step 2):Manual construction disables vocabulary;Step 3):The part-of-speech rule of generation is classified according to the number of contained part of speech;Step 4):Candidate terms are ranked up using TermRank sort algorithms.The present invention learns from patent title the part-of-speech rule of composition term automatically first with statistical method, solves the artificial deficiency for summarizing term part-of-speech rule;Candidate terms are ranked up using TermRank sort methods, linguistics and statistics feature in patent document has been considered, can preferably distinguish term and non-term, there is higher reliability, the needs of practical application can be met well.
Description
Technical field
The invention belongs to Chinese terminology automatic identification technology field, and in particular to a kind of Chinese patent document term is known automatically
Other method.
Background technology
Contain a large amount of field terms in Chinese patent document, it is information extraction, text to carry out automatic identification to these terms
The vital task in the fields such as excavation.Automatic term identification (Automatic Term Recognition, ATR) is that information extraction is ground
Study carefully the important component in field.It refers to by no manual intervention or as far as possible few Manual intervention method, from free text
Automatically identify the process for the vocabulary string that can represent universal in some professional domain.Pass through term automatic identification technology structure
The terminology bank built is very important basic data resource, is Chinese word segmentation, ontological construction, dictionary writing and renewal, automatic mark
Draw, information retrieval and machine translation etc. provide indispensable data and supported.In addition, sent out along with the high speed of information technology
Exhibition, digital information source increase severely with day, and these resources are carried out with the automatic identification of term for holding the newest hair in field in time
Exhibition situation and future developing trend tool are of great significance.
Chinese patent document is important digital information source, they recite the last word of each ambit into
Fruit, wherein there is substantial amounts of technical term.With reference to the observation analysis to Chinese patent document and the research of forefathers, it can be found that
Term in patent document has following several evident characteristics:(1) the term nesting phenomenon in patent document is relatively conventional;(2)
Term in patent document has stronger field correlation, i.e. high-frequency appears in the term in a certain field in other field
Middle low frequency occurrence even occurs without;(3) term in patent document has the characteristics of repeating, i.e., term is in whole patent text
Offer and occur in more documents of concentration;(4) patent term length is longer, is generally made up of 2-5 words;(5) patent term is mostly
It is made up of noun or complex noun.The characteristics of terms above be to Chinese patent document carry out term automatic identification it is important according to
According to.
At present, conventional term automatic identifying method mainly has two kinds:
The first is traditional rule with counting the term recognition methods being combined:In the process of generation candidate terms collection
In, first Chinese text is segmented and part-of-speech tagging processing, the language material marked by observation sum up the word for forming term
Property rule set, generation candidate terms collection is matched in language material using these part-of-speech rules;By the side of manual compiling part-of-speech rule
Too big to the linguistic knowledge dependence of author although formula accuracy of identification is higher, different people is write to same language material
Part-of-speech rule is not consistent;Although part-of-speech rule need not be utilized obtaining these methods of candidate terms stage, in distich
Too big to the dependent resource of outside during the son thick cutting of progress, the quality of external resource often determines obtained candidate terms collection
Quality;Aspect is being ranked up to candidate terms collection, currently conventional sort algorithm is present for the shorter art of identification length
Language or the frequency of occurrences it is relatively low term it is undesirable the defects of;
The method of second of identification term is to use gradually to tend to the machine of study hotspot in information extraction field in recent years
Learning algorithm, it is that its scale and quality requirement to training corpus is higher the defects of machine learning algorithm, and needs artificial mark
Mass data is noted, the training of language material is also required to spend longer time.
In addition, the candidate terms sort algorithm of main flow is undesirable to the shorter term recognition effect of length at present.
The content of the invention
For above-mentioned problems of the prior art, it is an object of the invention to provide one kind can avoid the occurrence of above-mentioned skill
The Chinese patent document term automatic identifying method of art defect.
In order to realize foregoing invention purpose, the technical solution adopted by the present invention is as follows:
A kind of Chinese patent document term automatic identifying method, comprises the following steps:
Step 1):Part-of-speech rule is automatically generated based on patent title, cut patent title using Chinese lexical analysis system
It is divided into substring and stop words, using the stop words as separator, the part-of-speech rule of the substring is extracted, and as life
Into the part-of-speech rule of candidate terms;
Step 2):Manual construction disables vocabulary, and stop words is added and disabled in vocabulary;
Step 3):The part-of-speech rule of generation is classified according to the number of contained part of speech, to every one kind word
Property rule according to frequency of occurrences descending arrange, and only take the rules of Top 5 be applied to Chinese patent document body part carry out
Part of speech matches, and generates candidate terms set, then the candidate terms extracted are classified according to the number of included word;
Step 4):Candidate terms are ranked up using TermRank sort algorithms, the TermRank sort algorithms are determined
Justice is as shown in formula (1):
Wherein, TiFor candidate terms, TR (Ti) it is candidate terms TiTermRank values;M is to include candidate terms TiIt is special
Sharp quantity of document;To include candidate terms TiPatent document djMiddle TiWord frequency;C(dj) it is patent document djIn take out
The candidate terms quantity of taking-up;|Ti| it is candidate terms TiLength, count (Ti) it is candidate terms TiIn the stop words that includes
Quantity;
Its TermRank value is calculated according to formula (1) to each described candidate terms in candidate terms list, through row
After sequence, Top-N bars are taken as final nomenclature.
Further, the step 2) specifically chooses stop words to build deactivation vocabulary using following three kinds of methods:
Method one:Word frequency statisticses are carried out after being segmented to patent title, stop words of the frequency higher than 20 is will appear from and adds deactivation
Vocabulary;
Method two:The part of speech substantially not appeared in term is added and disables vocabulary;
Method three:The deactivation vocabulary generated using methods described one and methods described two filters to the patent title
Afterwards, the remaining word string in the patent title is manually observed, if finding new stop words again, also adds it to deactivation
In vocabulary.
Further, in the step 3), the part-of-speech rule is divided into four classes, i.e. 2 word part-of-speech rules, 3 word parts of speech
Rule, 4 word part-of-speech rules and 5 word part-of-speech rules.
Further, in the step 3), the candidate terms are divided into four classes, i.e. 2 word candidate terms, 3 word candidates
Term, 4 word candidate terms and 5 word candidate terms.
Further, in the step 4), when M values are larger or smaller, formula (2) and formula (3) are utilized respectively
The Section 1 and Section 2 of the formula (1) are normalized, wherein, the formula (2) and formula (3) are respectively:
{|Ti|×count(Ti)-min|Ti|×count(Ti)}/{max|Ti|×count(Ti)-min|Ti|×count
(Ti)} (3)。
Chinese patent document term automatic identifying method provided by the invention, first with statistical method from patent title
In learn the part-of-speech rule of composition term automatically, solve the artificial deficiency for summarizing term part-of-speech rule;Using TermRank
Sort method is ranked up to candidate terms, has considered linguistics and statistics feature in patent document, can be preferable
Differentiation term and non-term, there is higher reliability, the needs of practical application can be met well.
Brief description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is Chinese patent title formalization representation schematic diagram.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and specific implementation
The present invention will be further described for example.It should be appreciated that specific embodiment described herein is only to explain the present invention, and do not have to
It is of the invention in limiting.
As shown in figure 1, a kind of Chinese patent document term automatic identifying method, comprises the following steps:
Step 1):Part-of-speech rule is automatically generated based on patent title, cut patent title using Chinese lexical analysis system
It is divided into substring and stop words, using the stop words as separator, the part-of-speech rule of the substring is extracted, and as life
Into the part-of-speech rule of candidate terms;
Patent document be usually to invention, utility model, appearance design record, its title is the height to whole document
Summarize, therefore often directly give the object to be described.A correct term is all comprised at least in the title of patent document.
According to the above feature of Chinese patent title, title is formally expressed as shown in Fig. 2 wherein, wi(i=1,2 ... n)
Represent the word that patent title is syncopated as by ICTCLAS, w1…wa, wc…wdAnd wf…wnFor the term in title, represent respectively
For CT1, CT2, CT3;wbAnd weIt is to be not belonging to the word that any term forms part, is called stop words herein, its construction method exists
3.2 sections are introduced.
With stop words ST1, ST2 is separator, and by substring CT1, CT2, CT3 part-of-speech rule extract, you can as under
One step generates the part-of-speech rule of candidate terms.For example, patent title:" one/m kinds/q is electronic/b automobiles/n /ude1 electricity/n shows
Show and term is included in device/n devices/n ":" electronic/b automobiles/n ", " electricity/n displays/n ".Extract their part-of-speech rule:
" b+n ", " n+n ", and they are concentrated added to part-of-speech rule, the part-of-speech rule of candidate terms is generated as next step.
Step 2):Manual construction disables vocabulary, and stop words is added and disabled in vocabulary;
Stop words is the valuable source that part-of-speech rule is automatically generated from patent title.Present invention selection manual construction disables
Vocabulary, rather than ready-made general deactivation vocabulary is directly used, it is because of some stop words in ready-made general deactivation vocabulary
The part of term is likely to be in the patent literature.For example, " row/v " exists in general deactivation vocabulary, but " entirely certainly
In dynamic/b rows/v paper/n machines/ng ", it is the part for forming term again, therefore can not be added into deactivation vocabulary.Similar " row/
V " is this kind of to be existed in general deactivation vocabulary, but in Chinese patent document be again the part for forming term word it is big in language material
Amount is present.
Stop words is specifically chosen using following three kinds of methods to build deactivation vocabulary:
Method one:Word frequency statisticses are carried out after being segmented to patent title, stop words of the frequency higher than 20 is will appear from and adds deactivation
Vocabulary;
Method two:The part of speech substantially not appeared in term is added and disables vocabulary;
Method three:The deactivation vocabulary generated using methods described one and methods described two filters to the patent title
Afterwards, the remaining word string in the patent title is manually observed, if finding new stop words again, also adds it to deactivation
In vocabulary.
Step 3):The part-of-speech rule of generation is classified according to the number of contained part of speech, the part of speech automatically generated
Regular quantity is more, all of which can not be applied to progress term matching in document, it is therefore desirable to selectively therefrom select
Go out part part-of-speech rule, in step 3), the part-of-speech rule is divided into four classes, i.e. 2 word part-of-speech rules, 3 word part-of-speech rules, 4
Word part-of-speech rule and 5 word part-of-speech rules;Then every one kind part-of-speech rule is arranged according to frequency of occurrences descending, and only taken
The body part that Top5 rules are applied to Chinese patent document carries out part of speech matching, generates candidate terms set, then will take out
The candidate terms of taking-up are classified according to the number of included word, and the candidate terms are divided into four classes, i.e. 2 word candidate's arts
Language, 3 word candidate terms, 4 word candidate terms and 5 word candidate terms, so the purpose to candidate terms classification is each in order to allow
The term of class length is all separately formed a candidate terms table, and the TermRank sort algorithms in it using step 4) are carried out
It can not be influenceed during sequence by the term of other length, so as to which ranking results are more fair;
Step 4):Purpose to candidate terms sequence is to determine final nomenclature.One good sort algorithm can
The term correctly or incorrectly disperseed in candidate terms list is resequenced, increases the weight of correct term, ranking position
Put as far as possible forward, vice versa.Candidate terms are ranked up using TermRank sort algorithms, the TermRank sequences are calculated
Method definition is as shown in formula (1):
Wherein, TiFor candidate terms, TR (Ti) it is candidate terms TiTermRank values;M is to include candidate terms TiIt is special
Sharp quantity of document;To include candidate terms TiPatent document djMiddle TiWord frequency;C(dj) it is patent document djIn take out
The candidate terms quantity of taking-up;|Ti| it is candidate terms TiLength, count (Ti) it is candidate terms TiIn the stop words that includes
Quantity;
Its TermRank value is calculated according to formula (1) to each described candidate terms in candidate terms list, through row
After sequence, taking Top-N bars, N values 5, that is, take Top-5 bars as final nomenclature herein as final nomenclature.
Section 1 and Section 2 in formula (1) might not on the same order of magnitude, as M > 1000 or M < 2,
The TermRank values of candidate terms are influenceed and little, now need that them are normalized respectively.Present invention choosing
Linear transformation method for normalizing is selected, is respectively shown in formula (2) and formula (3) to wherein first, second normalization formula:
{|Ti|×count(Ti)-min|Ti|×count(Ti)}/{max|Ti|×count(Ti)-min|Ti|×count
(Ti)} (3)。
9725 patent documents are handled using method proposed by the present invention, remove form and picture therein, are protected
It is 123M to save as the language material size after plain text.Patent document is segmented using ICTCLAS and part-of-speech tagging is handled.Part of speech
Mark uses Computer Department of the Chinese Academy of Science's two level part-of-speech tagging collection, i.e.,《ICTPOS3.0 Chinese part of speech label sets》.Using in step 2)
The method for disabling vocabulary is built, stop words totally 246 is included in the deactivation vocabulary finally built.Table 1 lists which part and stopped
Word.
Table 1:The artificial constructed part stop words disabled in vocabulary
Experimental result is judged using manual type.It is right to avoid the limitation of the subjectivity of people and domain knowledge
Respective markers are directly marked in term substantially correctly or incorrectly, and the candidate terms for being difficult discrimination correctness then utilize
Google search engine is judged.As long as meeting any one of situations below, then the candidate terms are labeled as correct art
Language, otherwise labeled as wrong term:1) corresponding entry in knowledge websites such as Wikipedia, Baidupedia, interactive encyclopaedias be present;
2) this entry be present in patent search system;3) Google search engine is not filtered or beaten to any composition in candidate terms
The processing such as random order.
Because experimental result collection is too big, it is difficult to total evaluation is carried out to the list after whole sequence, therefore using P@N evaluations
Method, that is, judge the accuracy rate (Precision) of preceding N bars in final nomenclature, and its calculation formula is as follows:
Using the method for automatically generating part-of-speech rule in step 1), from patent document title symbiosis into 2832 without weight
Multiple part-of-speech rule.Table 2 lists the Top5 bars after being sorted according to frequency.The statistical result demonstrates most of art from experimental data
Language is the characteristics of composition noun or complex noun.
Table 2:Automatically generate the citing of part-of-speech rule Top5 bars
Table 3 is the total frequency (2832) percentage shared by its frequency of occurrence after classifying to part-of-speech rule according to different length
Statistical information.The part-of-speech rule that wherein length is 4 and 5 accounts for 71.5% altogether, demonstrates the spy that term length is partially long in patent document
Point.
Table 3:The part-of-speech rule ratio of different length
This method by the automatic summary part-of-speech rule from the title of patent document is relative to traditional part-of-speech rule
Generation method, there is following both sides advantage:1) redundancy is greatly reduced:Part of speech rule are summarized relative to from patent text
Then, the part-of-speech rule that redundancy will be greatly reduced in part of speech is summarized from title;2) to participle and part-of-speech tagging instrument precision according to
Rely and reduce:No matter the term in title is correctly or incorrectly segmented and word frequency marks, its part-of-speech rule pattern all will be by
Part-of-speech rule is added to concentrate.When extracting candidate terms, if candidate terms by false segmentation and mark, will be also extracted.
Because the part-of-speech rule automatically generated is more, strictly all rules is applied in patent document and extracts candidate terms simultaneously
It is unnecessary.Therefore for the part-of-speech rule of every a kind of length, according to the height of frequency of occurrence, Top5 bars are only taken.Table 4 is different length
Spend the Top5 bars of part-of-speech rule.
Table 4:Different length part-of-speech rule Top5 bars
Extracted using the part-of-speech rule listed in table 4, then to patent document text.Extract 2 word candidate terms
493286;3 word candidate terms 152274;4 word candidate terms 31809;5 word candidate terms 3966.Table 5 is to extract
Part candidate terms and Corresponding matching part-of-speech rule.
The candidate terms quality extracted using part-of-speech rule is higher, but there is also partial noise.For example, candidate terms
Although " it is not real term in itself with reference to/v accompanying drawings/n " matching " V+N " part-of-speech rules;Candidate terms " displacement/v sensings
The part of speech of " displacement " in device/n " should be n, " voice/n declines/V-shaped/k mammary gland/n somascopes/n " correctly participle and part of speech
Mark should be " voice/n formulas/k is miniature/a mammary gland/n somascopes/n ".Although these word strings are mistakenly segmented or part of speech mark
Note, but remains as term in itself, and is correctly identified out, and this is exactly of the present invention to automatically generate part-of-speech rule
It is in place of advantage, i.e., smaller to the precision dependence of participle and part-of-speech tagging.
Table 5:Part candidate terms and the part-of-speech rule of matching
Candidate terms are divided into different candidate terms tables according to different words length, because only identification length is 2-5 herein
Word term, therefore obtain 4 candidate terms tables.Sequence to candidate terms is individually carried out on each candidate terms table,
It is in order to avoid causing iniquitous phenomenon to occur so as to be sorted to entirety because the identification of the candidate terms of certain class length is more.This
TF and C-Value methods as a comparison are chosen in invention.Table 6 is using P@N evaluation methods to final candidate terms ranking results
Statistical information, wherein N value 100,200,400,800,1000 successively.
Table 6:The P@N of candidate terms ranking results are evaluated
Candidate art of the TermRank methods proposed by the present invention to different length it can be seen from the experimental result in table 6
Language sequence effect is all significantly better than other two kinds of sort methods.On P@1000, knowledge of the TermRank methods to 3 word length terms
Other accuracy reaches more than 80%.The rule gradually successively decreased from the accuracy on P 100~P of@@1000 is also confirmed
TermRank has the ability for preferably distinguishing term and non-term.
Chinese patent document term automatic identifying method provided by the invention, first with statistical method from patent title
In learn the part-of-speech rule of composition term automatically, solve the artificial deficiency for summarizing term part-of-speech rule;Using TermRank
Sort method is ranked up to candidate terms, has considered linguistics and statistics feature in patent document, can be preferable
Differentiation term and non-term, there is higher reliability, the needs of practical application can be met well.
Embodiment described above only expresses embodiments of the present invention, and its description is more specific and detailed, but can not
Therefore it is interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art,
Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection model of the present invention
Enclose.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.
Claims (5)
1. a kind of Chinese patent document term automatic identifying method, it is characterised in that comprise the following steps:
Step 1):Part-of-speech rule is automatically generated based on patent title, is by patent title cutting using Chinese lexical analysis system
Substring and stop words, using the stop words as separator, the part-of-speech rule of the substring is extracted, and waited as generation
Select the part-of-speech rule of term;
Step 2):Manual construction disables vocabulary, and stop words is added and disabled in vocabulary;
Step 3):The part-of-speech rule of generation is classified according to the number of contained part of speech, every one kind part of speech is advised
Then arranged according to frequency of occurrences descending, and only take the rules of Top 5 to be applied to the body part of Chinese patent document and carry out part of speech
Matching, candidate terms set is generated, then the candidate terms extracted are classified according to the number of included word;
Step 4):Candidate terms are ranked up using TermRank sort algorithms, the TermRank sort algorithms definition is such as
Shown in formula (1):
<mrow>
<mi>T</mi>
<mi>R</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>T</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>M</mi>
</munderover>
<mfrac>
<mrow>
<msub>
<mi>TF</mi>
<msub>
<mi>T</mi>
<mi>i</mi>
</msub>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>C</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mo>|</mo>
<msub>
<mi>T</mi>
<mi>i</mi>
</msub>
<mo>|</mo>
<mo>&times;</mo>
<mi>c</mi>
<mi>o</mi>
<mi>u</mi>
<mi>n</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>T</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
<mo>,</mo>
</mrow>
Wherein, TiFor candidate terms, TR (Ti) it is candidate terms TiTermRank values;M is to include candidate terms TiPatent text
Offer quantity;To include candidate terms TiPatent document djMiddle TiWord frequency;C(dj) it is patent document djIn extract
Candidate terms quantity;|Ti| it is candidate terms TiLength, count (Ti) it is candidate terms TiIn the stop words quantity that includes;
Its TermRank value is calculated according to formula (1) to each described candidate terms in candidate terms list, after sorted,
Top-N bars are taken as final nomenclature.
2. Chinese patent document term automatic identifying method according to claim 1, it is characterised in that the step 2) tool
Body chooses stop words to build deactivation vocabulary using following three kinds of methods:
Method one:Word frequency statisticses are carried out after being segmented to patent title, stop words of the frequency higher than 20 is will appear from and adds deactivation vocabulary;
Method two:The part of speech substantially not appeared in term is added and disables vocabulary;
Method three;After the deactivation vocabulary generated using methods described one and methods described two filters to the patent title,
Remaining word string in the patent title is manually observed, if finding new stop words again, also adds it to stop words
In table.
3. Chinese patent document term automatic identifying method according to claim 1 or 2, it is characterised in that in the step
It is rapid 3) in, the part-of-speech rule is divided into four classes, i.e. 2 word part-of-speech rules, 3 word part-of-speech rules, 4 word part-of-speech rules and 5 word parts of speech
Rule.
4. Chinese patent document term automatic identifying method according to claim 1, it is characterised in that in the step 3)
In, the candidate terms are divided into four classes, i.e. 2 word candidate terms, 3 word candidate terms, 4 word candidate terms and 5 word candidate terms.
5. Chinese patent document term automatic identifying method according to claim 1, it is characterised in that in the step 4)
In, when M values are larger or smaller, it is utilized respectively formula (2) and formula (3) Section 1 and Section 2 to the formula (1)
It is normalized, wherein, the formula (2) and formula (3) are respectively:
<mrow>
<mrow>
<mo>{</mo>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>M</mi>
</munderover>
<mfrac>
<mrow>
<msub>
<mi>TF</mi>
<msub>
<mi>T</mi>
<mi>i</mi>
</msub>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>C</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mi>min</mi>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>M</mi>
</munderover>
<mfrac>
<mrow>
<msub>
<mi>TF</mi>
<msub>
<mi>T</mi>
<mi>i</mi>
</msub>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>C</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
<mo>}</mo>
</mrow>
<mo>/</mo>
<mrow>
<mo>{</mo>
<mrow>
<mi>max</mi>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>M</mi>
</munderover>
<mfrac>
<mrow>
<msub>
<mi>TF</mi>
<msub>
<mi>T</mi>
<mi>i</mi>
</msub>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>C</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mi>min</mi>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>M</mi>
</munderover>
<mfrac>
<mrow>
<msub>
<mi>TF</mi>
<msub>
<mi>T</mi>
<mi>i</mi>
</msub>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>C</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
<mo>}</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
{|Ti|×count(Ti)-min|Ti|×count(Ti)}/{max|Ti|×count(Ti)-min|Ti|×count(Ti)} (3)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510623936.4A CN105224520B (en) | 2015-09-28 | 2015-09-28 | A kind of Chinese patent document term automatic identifying method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510623936.4A CN105224520B (en) | 2015-09-28 | 2015-09-28 | A kind of Chinese patent document term automatic identifying method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105224520A CN105224520A (en) | 2016-01-06 |
CN105224520B true CN105224520B (en) | 2018-03-13 |
Family
ID=54993498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510623936.4A Active CN105224520B (en) | 2015-09-28 | 2015-09-28 | A kind of Chinese patent document term automatic identifying method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105224520B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108090041B (en) * | 2016-11-22 | 2021-05-18 | 北京国双科技有限公司 | Method and device for generating advertisement creativity |
CN107608949B (en) * | 2017-10-16 | 2019-04-16 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
CN108009155A (en) * | 2017-12-22 | 2018-05-08 | 联想(北京)有限公司 | Data processing method and system and server |
CN108280198B (en) * | 2018-01-29 | 2021-03-02 | 口碑(上海)信息技术有限公司 | List generation method and apparatus |
CN108897736B (en) * | 2018-06-20 | 2022-04-12 | 大连诺道认知医学技术有限公司 | Document sorting method and device based on Paper Rank algorithm |
CN109344402B (en) * | 2018-09-20 | 2023-08-04 | 中国科学技术信息研究所 | New term automatic discovery and identification method |
CN112487801A (en) * | 2020-10-23 | 2021-03-12 | 南京航空航天大学 | Term recommendation method and system for safety-critical software |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7197449B2 (en) * | 2001-10-30 | 2007-03-27 | Intel Corporation | Method for extracting name entities and jargon terms using a suffix tree data structure |
JP4816409B2 (en) * | 2006-01-10 | 2011-11-16 | 日産自動車株式会社 | Recognition dictionary system and updating method thereof |
CN100520782C (en) * | 2007-11-09 | 2009-07-29 | 清华大学 | News keyword abstraction method based on word frequency and multi-component grammar |
CN101655866B (en) * | 2009-08-14 | 2010-12-15 | 北京中献电子技术开发中心 | Automatic decimation method of scientific and technical terminology |
US9208145B2 (en) * | 2012-05-07 | 2015-12-08 | Educational Testing Service | Computer-implemented systems and methods for non-monotonic recognition of phrasal terms |
CN103268339B (en) * | 2013-05-17 | 2016-06-01 | 中国科学院计算技术研究所 | Named entity recognition method and system in Twitter message |
CN104216880B (en) * | 2013-05-29 | 2017-06-16 | 北京信息科技大学 | Term based on internet defines discrimination method |
CN103678656A (en) * | 2013-12-23 | 2014-03-26 | 合肥工业大学 | Unsupervised automatic extraction method of microblog new words based on repeated word strings |
CN103778243B (en) * | 2014-02-11 | 2017-02-08 | 北京信息科技大学 | Domain term extraction method |
CN103885934B (en) * | 2014-02-19 | 2017-05-03 | 中国专利信息中心 | Method for automatically extracting key phrases of patent documents |
-
2015
- 2015-09-28 CN CN201510623936.4A patent/CN105224520B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN105224520A (en) | 2016-01-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105224520B (en) | A kind of Chinese patent document term automatic identifying method | |
CN111177365B (en) | Unsupervised automatic abstract extraction method based on graph model | |
Tiedemann et al. | Efficient discrimination between closely related languages | |
CN106294320B (en) | A kind of terminology extraction method and system towards academic paper | |
Butnaru et al. | Moroco: The moldavian and romanian dialectal corpus | |
CN107992633A (en) | Electronic document automatic classification method and system based on keyword feature | |
CN107463607A (en) | The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
US7962507B2 (en) | Web content mining of pair-based data | |
CN108509409A (en) | A method of automatically generating semantic similarity sentence sample | |
CN105095196B (en) | The method and apparatus of new word discovery in text | |
CN109885675B (en) | Text subtopic discovery method based on improved LDA | |
CN108681574A (en) | A kind of non-true class quiz answers selection method and system based on text snippet | |
CN108038099B (en) | Low-frequency keyword identification method based on word clustering | |
CN110472203B (en) | Article duplicate checking and detecting method, device, equipment and storage medium | |
CN106055560A (en) | Method for collecting data of word segmentation dictionary based on statistical machine learning method | |
CN111191022A (en) | Method and device for generating short titles of commodities | |
CN110399606A (en) | A kind of unsupervised electric power document subject matter generation method and system | |
CN107329960A (en) | Unregistered word translating equipment and method in a kind of neural network machine translation of context-sensitive | |
CN107526841A (en) | A kind of Tibetan language text summarization generation method based on Web | |
CN106708798A (en) | String segmentation method and device | |
CN110705247A (en) | Based on x2-C text similarity calculation method | |
CN109062895A (en) | A kind of intelligent semantic processing method | |
CN112966117A (en) | Entity linking method | |
Regina et al. | Clickbait headline detection using supervised learning method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |