CN108920456A - A kind of keyword Automatic method - Google Patents

A kind of keyword Automatic method Download PDF

Info

Publication number
CN108920456A
CN108920456A CN201810611476.7A CN201810611476A CN108920456A CN 108920456 A CN108920456 A CN 108920456A CN 201810611476 A CN201810611476 A CN 201810611476A CN 108920456 A CN108920456 A CN 108920456A
Authority
CN
China
Prior art keywords
word
candidate keywords
keyword
technical standard
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810611476.7A
Other languages
Chinese (zh)
Other versions
CN108920456B (en
Inventor
吕学强
董志安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201810611476.7A priority Critical patent/CN108920456B/en
Publication of CN108920456A publication Critical patent/CN108920456A/en
Application granted granted Critical
Publication of CN108920456B publication Critical patent/CN108920456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of keyword Automatic methods, including:General term in extraction technique standard, extract candidate keywords, after filtering general term for candidate keywords, integrated location feature, Term co-occurrence feature and context semantic feature calculate candidate keywords weighted score, according to candidate keywords weighted score range computation dynamic threshold, dynamic threshold definitive result keyword is utilized.Keyword Automatic method provided by the invention, merge position feature, Term co-occurrence feature and context semantic feature extracting keywords, inside documents position and context semantic feature are comprehensively considered to the weights influence of keyword, higher accuracy and recall rate are reached, improve 3GPP technical standard retrieval quality, cost of labor is reduced, the needs of practical application can be met well.

Description

A kind of keyword Automatic method
Technical field
The invention belongs to keyword Automatic technical fields, and in particular to a kind of key towards 3GPP technical standard The automatic abstracting method of word.
Background technique
Flourishing for mobile communication technology brings epoch-making change to human society.As communications field forward position The norm-giver of technology, third generation partner program (The 3rd Generation Partnership Project, letter Claim 3GPP) be dedicated to promoting based on evolution global mobile communication (GSM) core network (including WCDMA, TD-SCDMAE, EDGE etc.) 3G standard.
In recent years, the case between larger communication scientific & technical corporation about action for infringement of a patent dispute is commonplace, and invention is special The stability of economic rights is by unprecedented challenge.3GPP technical standard plays and can not replace in communication patent examination work The key player in generation.
3GPP technical standard is the distinctive science and technology-oriented non-patent literature of one kind in the work of communications field patent examination, usually File measures the creativeness and novelty that the communications field is applied for a patent as a comparison.
Mainly comprising standard No., publication number, Document Title and version number's letter in typical 3GPP technical standard Cover Breath, Forword partial interpretation version number, the part Scope statement application range, the part Reference provides bibliography column Table, Definitions and abbreviations list part the important definition and abbreviation of document, and Topic body divide chapter Section specifically introduces technical background and details, and Annex relates generally to version and changes history.
In addition, there is also the incidence relations mutually quoted between 3GPP technical standard and patent document, with patent document Between difference it is as shown in table 1.
1 patent document of table and 3GPP technical standard are distinguished
From table 1 it follows that 3GPP technical standard has itself unique institutional framework and type.Practical patent is examined Concern is primarily with technical specifications (Technical Specification, TS), technical report (Technical in looking into Report, TR) and committee paper.Wherein, technical specification and technical report concentration describe technology relevant regulations, principle, imitate True and experimental result, the specific conferencing information of each working group of committee paper essential record.In contrast, technical specification and skill Art report content format is close, and the core technology information carried contains bigger tap value compared with horn of plenty.
In practical patent examination, the retrieval keyword exhibition that mainly foundation auditor chooses by hand of 3GPP technical standard It opens.The quality of search result tends to rely on the quality of keyword, and this traditional approach not only takes time and effort, but also is difficult to protect Demonstrate,prove the hit rate of documents.Compared to patent document, 3GPP technical standard have broad covered area, contain much information, format it is irregular And readable weak feature, these features directly determine 3GPP technical standard compared with patent document in automatic extracting keywords Aspect has higher difficulty.Therefore, the automatic extraction effect for improving 3GPP technical standard keyword not only facilitates promotion and leads to Believe patent examination efficiency, and is of great significance to maintenance license stability.
Keyword Automatic method at home and abroad has a large amount of correlative studys, generally include supervised learning method and Unsupervised learning method Liang great branch.Wherein, keyword abstraction problem is generally converted to machine learning by supervised learning method In two classification or polytypic problem, be mainly concerned with naive Bayesian (Naive Bayes), maximum entropy (Maximum Entropy), the disaggregated models such as support vector machines (Support Vector Machines).Although this method is to a certain degree Upper prediction effect is preferable, but extracts effect and be often depending on the mark quality of training corpus and the scale of training corpus, nothing Method avoids excessive human input, it is difficult to adapt to the scene of mass data in practical application.Unsupervised learning method is relative to having Most clear advantage is that dramatically save human cost for supervised learning method, base can be divided into according to algorithm idea The abstracting method of abstracting method in statistics, the abstracting method based on topic model and word-based graph model.Wherein, it is based on The abstracting method of statistics is general with word frequency (term frequency) information, term frequency-inverse document frequency (TF-IDF) and χ2 The statistical indicators such as value measure the weights of candidate keywords, and this method is easy the important low frequency of holiday to frequency sensitive Word.Most classic represent of abstracting method based on topic model is LDA (Latent Dirichlet allocation) model Algorithm, LDA model is by deducing " document-theme " probability from known " document-lexical item " matrix to training corpus analysis Distribution and " theme-lexical item " probability distribution, this method extract the theme distribution characteristic that effect depends on training set itself.Base It is most widely used in the abstracting method of word figure with TextRank algorithm, TextRank algorithm thought source is in Google PageRank algorithm, the algorithm in text set sentence or word constituted node of graph, with similar between node (sentence or word) The weight as side is spent, importance ranking is carried out to the node in graph model using iteration voting mechanism, although this method is disobeyed Rely amount of text, but it is limited in that and only accounts for text internal information and have ignored point of the vocabulary between different texts Cloth characteristic.Main stream approach at this stage carries out keyword abstraction generally directed to the advantage of particular problem fusion distinct methods, exists Defect have:The considerations of shortcoming is to semantic feature, recognition effect is bad, low frequency keyword recognition effect is bad etc..
Summary of the invention
For above-mentioned problems of the prior art, the purpose of the present invention is to provide the avoidable appearance of one kind is above-mentioned The keyword Automatic method of technological deficiency.
In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows:
A kind of keyword Automatic method, including:General term in extraction technique standard extracts candidate keywords, needle General term is filtered to candidate keywords, integrated location feature, Term co-occurrence feature and context semantic feature calculate candidate key Word weighted score, it is crucial using dynamic threshold definitive result according to candidate keywords weighted score range computation dynamic threshold Word.
Further, the keyword Automatic method includes:
Step 1) removes the text noise in 3GPP technical standard;
General term in step 2) extraction technique standard;
Step 3) is based on parsing tree and extracts candidate keywords and filter general term;
It is semantic that step 4) comprehensively considers position feature, Term co-occurrence feature and the context of candidate keywords in a document Feature calculates weighted score and sorts, and finally calculates dynamic threshold according to the practical score range of technical standard, is more than by score Result keyword collection is added in the candidate keywords of threshold value.
Further, step 1) is specially:Using in Apache POI analytic technique standard removal 3GPP technical standard Text noise.
Further, the text noise includes picture, table, formula, additional character and forbidden character.
Further, step 2) includes:Based on the general term in word frequency-Document distribution entropy extraction technique standard, word frequency- Document distribution entropy refers to word w in the uncertainty measure of technical standard integrated distribution state;If being made of n technical standard Document sets be expressed as D={ d1, d2…di…dn, word frequency-Document distribution entropy of note word w is H (w), then H (w) calculates public Formula is
Wherein, P (w, di) it is that word w appears in technical standard diIn probability, 1≤i≤n, according to maximal possibility estimation Method, P (w, di) calculation formula be
Wherein, f (w, di) it is word w in technical standard diThe number of middle appearance.
Further, extracting candidate keywords based on interdependent parsing tree includes:
Step1:Traversal technology standard set D, to each technical standard d in DiIt is divided into sentence by punctuate, will be divided Sentence set representations afterwards are
1≤i≤ns, nsFor document diMiddle sentence number;
Step2:To set Sentences (di) in each sentence utilize Stanford Parser syntactic analysis Device carries out interdependent syntactic analysis, obtains corresponding interdependent parsing tree set Trees (di), noteWherein TiPresentation technology standard diIn the corresponding interdependent syntax of i-th of sentence Parsing tree;
Step3:Circulation reads interdependent parsing tree set Trees (di), to any one interdependent syntax tree Ti∈ Trees(di), by syntax dependency tree word and corresponding part of speech regard leaf node as a whole, it is orderly with middle sequence Mode traverse TiIf present node is leaf node, judge whether the part of speech of the node is noun, verb or adjective, Meet condition and candidate keywords concentration then is added in the node, otherwise skips to next node;If present node is not leaf section Point then judges whether present node is noun phrase, if it is noun phrase and right subtree non-empty then to continue recursive traversal current The right subtree of node, until being not present in subtree using noun phrase as the non-leaf nodes of father node, at this time by noun phrase Candidate keywords concentration is added in child node as a whole;
Step4:Candidate key word set is further filtered using the general term extracted, if candidate keywords concentration is deposited Then the element is concentrated from candidate keywords in the element comprising general term and is removed.
Further, the calculation method of position feature weight includes:It is corresponding for 3GPP technical standard different levels title Body part sentence collection is respectively divided by boundary of punctuate, the number consecutively since 1 of the sentence in distich subset, remember technology Standard diMiddle candidate key word set CK (di)={ ck1, ck2…cki…ckn, wherein ckiIt is closed for any one candidate in set Keyword, n are candidate keywords number, and note specific position collection is combined into
SP={ Title, Scope, Reference, Definitions, Abbrevations, NOTE },
locate(cki) indicate candidate keywords ckiThe position of appearance, defined feature function Pos (cki) indicate candidate pass Keyword ckiWeight assignment on this dimension of appearance position, then
Wherein, SnockiIndicate candidate keywords ckiThe sentence at place is numbered, SnuckiIndicate candidate keywords ckiPlace Sentence quantity in text paragraph, len (cki) indicate candidate keywords ckiThe word number for including;It will appear in different location Weight is averaged, and remembers W (Pos (cki)) indicate position weight average value, then
Wherein, fre (cki) indicate candidate keywords ckiThe frequency occurred in same piece technical standard.
Further, Term co-occurrence feature weight calculation method is:
The candidate key word set for remembering all technical standards is CK={ CK (d1), CK (d2)…CK(di)…CK(dn), it is right Technical standard diIn any one candidate keywords cki, note composition ckiWord be respectively cw1, cw2… cwi…cwm, m cki The word number for including remembers cwiCo-occurrence word set be cocuri={ wco1, wco2…wcoi…wcop, p is the big of co-occurrence word set It is small, wherein wcojIndicate word cwiOne of co-occurrence word, wcoj∈CK(di) and meet wco1∩wco2∩…∩wcoj ∩…∩wcop={ cwi, wherein 1≤j≤p, then cwiTo candidate keywords ckiContribution be expressed as
Wherein, fre (corj) indicate word cwiCo-occurrence word wcojThe frequency of appearance, len (wcoj) indicate co-occurrence word wcojThe word number for including;As candidate keywords ckiWhen comprising multiple words, candidate keywords ck is calculatediTerm co-occurrence this Weight calculation formula in dimension is
Further, context semantic feature weighing computation method is:
Calculating task is decomposed into and independently predicts that each constitutes context Context (w) word by current word w Maximum probability value, objective function are
Wherein ci∈ Context (w), D are technical standard corpus, and θ is model parameter, conditional probability P (ci| w) indicate For
Wherein,And vwRespectively word ciIt being indicated with the vector of w, c ' is all unduplicated words in corpus, vc' indicated for the vector of c ';By each technical standard d in technical standard set DiRegard as by a series of word w1… wi…wnIt constitutes, it is assumed that it is mutually indepedent between word, to technical standard diIn each candidate keywords cki, if word type, The formula for then calculating prediction probability is
If phrase type, then calculation formula is
Above formula both sides are taken the logP (w on the left side after logarithm1…wi…wn|cki) as measurement candidate keywords cki? The weight metric of this semantic dimension, is denoted as W (Sem (cki)), by logP (w1…wi…wn|cki) approximation regards logP (c as1… ci…cn|cki), wherein w1…wi…wnFor candidate keywords ckiContext within the scope of model window, is abbreviated as Context(cki), then W (Sem (cki)) calculation formula is
Further, the step 4) includes:
To technical standard diIn any one candidate keywords cki, comprehensively consider position feature, Term co-occurrence feature and on Hereafter semantic feature calculates candidate keywords ckiThe formula of weighted score in three characteristic dimensions is
W(cki)=W (Pos (cki))+W(Coo(cki))+W(Sem(cki));
Remember diIn each candidate keywords ckiCorresponding score
Score(di)={ W (ck1)…W(cki)…W(ckn)), to Score (di) in score sort from high to low, if Determine the average value that dynamic threshold λ is all scores, its calculation formula is
If diMiddle candidate keywords meet W (cki) >=λ, then by ckiIt is added to result keyword concentration.
Keyword Automatic method provided by the invention, fusion position feature, Term co-occurrence feature and context are semantic Feature extraction keyword comprehensively considers inside documents position and context semantic feature to the weights influence of keyword, reaches Higher accuracy and recall rate, improve 3GPP technical standard retrieval quality, reduce cost of labor, can be well Meet the needs of practical application.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is interdependent syntactic analysis tree graph;
Fig. 3 is that CBOW model and Skip_gram model framework compare figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing and specific implementation The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work Every other embodiment obtained is put, shall fall within the protection scope of the present invention.
A kind of keyword Automatic method proposed by the present invention is proposed based on word frequency-Document distribution entropy method first The general term in 3GPP technical standard is extracted, then proposes that the algorithm based on interdependent parsing tree extracts candidate keywords, needle After filtering general term to candidate keywords, integrated location feature, Term co-occurrence feature and context semantic feature calculate candidate close Keyword weighted score is finally utilized according to the candidate keywords weighted score range computation dynamic threshold of each technical standard Dynamic threshold definitive result keyword.Specifically, as shown in Figure 1, a kind of keyword Automatic method, includes the following steps:
Step 1) carries out pretreatment operation to 3GPP technical standard, and main includes using Apache POI analytic technique mark Standard, the texts noises such as picture, table, formula, additional character and forbidden character in removal technology standard;
Step 2) extracts the general term in all technical standards based on word frequency-Document distribution entropy;
Each technical standard cutting is sentence collection by step 3), carries out interdependent syntactic analysis to each sentence, based on according to Parsing tree is deposited to extract candidate keywords and filter general term;
It is semantic that step 4) comprehensively considers position feature, Term co-occurrence feature and the context of candidate keywords in a document Feature calculates weighted score and sorts, and finally calculates dynamic threshold according to the practical score range of technical standard, is more than by score Result keyword collection is added in the candidate keywords of threshold value.
Not only comprising being similar to the simple stop words such as " if ", " at ", " not ", " or " in 3GPP technical standard, but also Also include the general term through most technical standards, for example, " Figure ", " version ", " general ", It is distinctive and do not have representative and importance word that " given " etc. belongs to technical standard.It is observed that can find, it is either simple Distinctive general term appears in different editions and different types of skill with the different frequencys inside single stop words or technical standard In art standard, the circulation with height, and cannot generally summarize or take out specific technical standard content.These words system Referred to as general term.
Obviously, coverage is not comprehensive enough if only selecting the common deactivated vocabulary manually collected.Therefore, in order to as far as possible Interference of the general term to keyword abstraction task is reduced, it is automatic that combining information Entropy principle introduces word frequency-Document distribution entropy concept Acquiring technology standard universal word.Comentropy is introduced into information theory by Shannon earliest, for measuring discrete random variable Uncertainty, the information entropy the big, indicates that the corresponding uncertainty of stochastic variable is bigger.Similarly, word w is regarded as at random Variable provides word frequency-Document distribution entropy and is defined as follows.
Defining 1 word frequency-Document distribution entropy refers to word w in the uncertainty measure of technical standard integrated distribution state.
If being expressed as D={ d by the document sets that n technical standard is constituted1, d2…di…dn, remember word frequency-text of word w Shelves Distribution Entropy is H (w), then shown in H (w) calculation method such as formula (1),
Wherein, P (w, di) it is that word w appears in technical standard diIn probability, 1≤i≤n, according to maximal possibility estimation Method, P (w, di) can be calculated by formula (2),
Wherein, f (w, di) it is word w in technical standard diThe number of middle appearance.If can be seen that the technical bid containing w The abundanter and w of standard is more uniform in technical standard integrated distribution, then word frequency-Document distribution entropy H (w) is bigger, while showing w in skill The uncertainty being distributed in art standard set D is bigger, thus w is more likely to be technical standard and concentrates the general term for not having importance.
Statistics shows that most of keywords are usually the notional words phrase such as noun, verb, adjective, generally not comprising no reality The general term being evenly distributed in the stop words and technical standard of border meaning.It therefore, is that removal is general by keyword scope Verb, adjective, noun and noun phrase after word.It is complete with semantic coherence and syntax modification in order to extract Property candidate keywords, interdependent syntactic analysis is carried out to the sentence in 3GPP technical standard first, then in conjunction with interdependent syntax point Analysis tree extracts the noun phrase for wherein meeting syntax modification continuity, verb, adjective and noun addition candidate keywords It concentrates.Wherein, for noun phrase the NP of minimum particle size using in parsing tree as candidate keywords.Finally for candidate Keyword set carries out general term filtering.Such as:To sentence " logical channels are SAPs between MAC and After RLC " carries out syntactic analysis, as a result as shown in Figure 2.
Figure it is seen that adjective " logical " modification noun " channels ", " logical " and " channels " collectively forms a noun phrase (NP), and " SAPs " and " MAC and RLC " is noun phrase (NP), " SAPs between MAC and RLC " is whole simultaneously is used as a noun phrase (NP) again, but in interdependent parsing tree In prepositional phrase PP for constituting of " MAC and RLC " and " between " and son section that " SAPs " this noun phrase is all NP Point, and be between the two the relationship of the brotgher of node.Obviously, using " MAC and RLC " as noun phrase ratio " SAPs Between MAC and RLC " is smaller as the granularity of noun phrase.Therefore, in example sentence choose " logical ", " channels ", " logical channels ", " are ", " SAPs ", " MAC ", " RLC " and " MAC and RLC " conduct Candidate keywords, while candidate keywords are filtered using the general term extracted.According to the above analysis, it is based on interdependent syntactic analysis Steps are as follows for the candidate keywords extraction algorithm of tree:
Step1:Traversal technology standard set D, to each technical standard d in DiIt is divided into sentence by punctuate, will be divided Sentence set representations afterwards are
1≤i≤ns, nsFor document diMiddle sentence number.
Step2:To set Sentences (di) in each sentence utilize Stanford Parser Method analyzer carries out interdependent syntactic analysis, obtains corresponding interdependent parsing tree set Trees (di), noteWherein TiPresentation technology standard diIn the corresponding interdependent syntax of i-th of sentence Parsing tree.
Step3:Circulation reads interdependent parsing tree set Trees (di), to any one interdependent syntax tree Ti∈ Trees(di), by syntax dependency tree word and corresponding part of speech regard leaf node as a whole, it is orderly with middle sequence Mode traverse TiIf present node be leaf node (not being last leaf node), judge the node part of speech whether For noun, verb, adjective, meets condition and candidate keywords concentration then is added in the node, otherwise skip to next node; If present node is not leaf node, judge whether present node is noun phrase (NP), if it is noun phrase and right son Tree non-empty then continues the right subtree of recursive traversal present node, until there is no using NP as the non-leaf section of father node in subtree The child node of NP is added candidate keywords at this time as a whole and concentrated by point.
Step4:Some technical standard general terms are still had in the candidate keywords extracted due to previous step, It needs further to filter candidate key word set using the general term extracted, exist if candidate keywords are concentrated comprising general The element is then concentrated from candidate keywords and is removed by the element of word.
By analyze 3GPP technical standard feature, it can be found that remove text other than Scope, Reference, The part Definitions and Abbreviations has important reference value to entire document, ought to examine as emphasis position Consider.Each chapters and sections content of body part is usually surrounded apart from nearest title expansion, therefore as, title can be regarded to corresponding paragraph core The concentration of intracardiac appearance, the candidate keywords occurred in the position should assign higher weight.Similarly, chapters and sections remarks are appeared in (NOTE) content in generally plays the role of emphasis added or supplementary explanation in the text, therefore should also be used as specific position Processing.
Therefore, position candidate keywords occurred in 3GPP technical standard is as a weights influence factor.For Sentence collection is respectively divided using punctuate as boundary in the corresponding body part of 3GPP technical standard different levels title, in distich subset Sentence since 1 number consecutively, if the number of sentence is smaller where candidate keywords, closer to title, showing more to have can It can be the keyword of purport of linking closely.Remember technical standard diMiddle candidate key word set CK (di)={ ck1, ck2…cki…ckn, In, ckiFor any one candidate keywords in set, n is candidate keywords number, and note specific position collection is combined into
SP={ Title, Scope, Reference, Definitions, Abbrevations, NOTE },
locate(cki) indicate candidate keywords ckiThe position of appearance, defined feature function Pos (cki) indicate candidate pass Keyword ckiWeight assignment on this dimension of appearance position, then Pos (cki) be represented by shown in formula (3).
Wherein, SnockiIndicate candidate keywords ckiThe sentence at place is numbered, SnuckiIndicate candidate keywords ckiPlace Sentence quantity in text paragraph, len (cki) indicate candidate keywords ckiThe word number for including.Denominator adds len (cki) be for Position weight is avoided to occur for 0 the case where, due to candidate keywords ckiIn technical standard diIt is likely to occur at middle different location more It is secondary, therefore the weight that will appear in different location is averaged, and remembers W (Pos (cki)) indicate position weight average value, then its Shown in calculation method such as formula (4).
Wherein, fre (cki) indicate candidate keywords ckiThe frequency occurred in same piece technical standard.It is this to be averaged Candidate keywords ck can be enhanced in the processing mode of valueiFrequency is lower but weight when appearing in specific position, weakens and only leans on Frequecy characteristic calculates candidate keywords weight bring deviation.
Term co-occurrence is characterized in a factor very important in keyword abstraction.By observing the 3GPP technology extracted Standard candidate keywords, there are the composition words of a candidate word to be repetitively appearing in the candidate word of other different lengths for discovery Phenomenon, such as:For " MCH ", " MCH transmission ", " MCH subframe allocation " three candidate keys Word, " MCH " are appeared in the candidate keywords of other two different length, can be by " MCH transmission " and " MCH Subframe allocation " regards the co-occurrence word of " MCH " as, these co-occurrence words are often expressed to be had more than individually forming word The information of body.Therefore, if some word of composition candidate keywords has more co-occurrence word, it is believed that the word Richer meaning can be amplified out, it should assign higher weight.According to the above analysis, by being total to for candidate keywords composition word Existing word frequency rate and word are long as Term co-occurrence feature calculation candidate keywords weight.
The candidate key word set for remembering all technical standards is CK={ CK (d1), CK (d2)…CK(di)…CK(dn), it is right Technical standard diIn any one candidate keywords cki, note composition ckiWord be respectively cw1, cw2… cwi…cwm, m cki The word number for including remembers cwiCo-occurrence word set be cocuri={ wco1, wco2… wcoi…wcop, p is co-occurrence word set Size (i.e. co-occurrence word concentrate co-occurrence word number), wherein wcojIndicate word cwiOne of co-occurrence word, wcoj∈CK (di) and meet wco1∩wco2∩…∩wcoj∩…∩wcop={ cwi, wherein 1≤j≤p, then cwiTo candidate keywords cki Contribution can with formula (5) indicate;
Wherein, fre (corj) indicate word cwiCo-occurrence word wcojThe frequency of appearance, len (wcoj) indicate co-occurrence word wcojThe word number for including.As candidate keywords ckiWhen comprising multiple words, candidate keywords ck is calculatediTerm co-occurrence this Shown in weighing computation method such as formula (6) in dimension.
As can be seen that working as candidate keywords ckiEach composition word frequency it is numerous appear in more co-occurrence words, then each composition Word is to candidate keywords ckiContribution it is bigger, therefore candidate keywords ckiWeight on this dimension of Term co-occurrence feature is got over Greatly.
The general height of keyword it is condensed the core content of technical standard, often have from different semantic levels and concentrate body The general character of existing technical standard purport.Therefore, influence of the semantic feature of candidate keywords within a context to weight is not allowed to neglect yet Depending on.In view of term vector can preferably characterize the feature of semanteme, introduce Word2vec calculate candidate keywords semantic feature this Weight in dimension.
Word2vec is in a kind of solution statistical language model calculating process that google is proposed based on deep learning thought The scheme implementation tool for the problems such as lacking model generalization power and dimension disaster.Word2vec includes CBOW and Skip_gram two Kind training pattern, in order to reduce the complexity of model solution, while providing Hierachy Softmax (HS) and Negative Two class of Sampling (NS) trains optimization method, and training frame is composed of training pattern and optimization method.As shown in figure 3, The training frame being made of two kinds of models has in common that all comprising input layer (Input Layer), projection layer (Projection Layer) and output layer (Output Layer), difference show as the training frame based on CBOW model The context semantic environment occurred according to vocabulary predicts current word w, and the training frame based on Skip_gram model is then root Context semantic information is predicted according to current word w.
In order to solve the problems, such as to be predicted its context Context (w) (window c), Skip_gram model by current word w Calculating task is decomposed into and independently predicts that each constitutes the maximum probability of context Context (w) word by current word w Value, objective function are
Wherein ci∈ Context (w), D are technical standard corpus, and θ is model parameter, conditional probability P (ci| w) by Softmax normalization indicates, as shown in formula (7);
Wherein,And vwRespectively word ciIt is indicated with the vector of w, c ' is all unduplicated words in corpus, number Measure it is larger, can be used Hierachy Softmax or Negative Sampling optimization, vc' indicated for the vector of c '.By skill Each technical standard d in art standard set DiRegard as by a series of word w1…wi… wnIt constitutes, it is assumed that between word Independently of each other, to technical standard diIn each candidate keywords cki, if word type, then it is general prediction to be calculated with formula (8) Rate is then calculated using formula (9) if phrase type;
Wherein P (wj|cki) calculated by variable replacement using formula (7), it can be seen that as prediction probability P (w1… wi…wn|cki) it is bigger when, then candidate keywords ckiIt predicts that the effect of contextual information is better, is more likely to be characterization full text The keyword of information.In order to be avoided as far as possible due to even multiplying the mistake that calculating process conditional probability is too small and appearance is extremely subtle Difference takes the logP (w on the left side after logarithm above formula both sides1…wi… wn|cki) as measurement candidate keywords ckiIt is semantic this The weight metric of dimension is denoted as W (Sem (cki)), while considering to build similar word when Word2vec training corpus Association has been stood, has been calculated to simplify, by logP (w1…wi…wn|cki) approximation regards logP (c as1…ci…cn|cki), wherein w1…wi…wnFor candidate keywords ckiContext within the scope of model window is abbreviated as Context (cki), then W (Sem (cki)) shown in calculation method such as formula (10);
To technical standard diIn any one candidate keywords cki, comprehensively consider position feature, Term co-occurrence feature and on Hereafter semantic feature calculates candidate keywords ck using formula (11)iWeighted score in three characteristic dimensions.
W(cki)=W (Pos (cki))+W(Coo(cki))+W(Sem(cki))
(11)。
It can be to avoid the insufficient shadow to keyword abstraction effect of single feature factor by three different characteristics of fusion It rings, remembers diIn each candidate keywords ckiCorresponding score
Score(di)={ W (ck1)…W(cki)…W(ckn)), to Score (di) in score sort from high to low, if The average value that dynamic threshold λ is all scores is determined, shown in calculation such as formula (12);
If diMiddle candidate keywords meet W (cki) >=λ, then by ckiIt is added to result keyword concentration.Why do not select Selecting fixed threshold is to have otherness in terms of length because of different technical standards, and different technical standards is calculated Candidate keywords score range it is also different, therefore set dynamic threshold for the practical score range of single technical standard.
It is tested using method of the invention, experimental data is selected from (including the skill of technical standard in 2016 on the website 3GPP Art specification and technical report), after duplicate removal denoises, 8000 are obtained in total.Effective serial (series) of technical standard is compiled Number range is 01~12,21~38,41~46,48~52 and 55, amounts to 42 series, each series includes multiple versions, Size is 14G, and each technical standard is by Cover, Forword, Scope, Reference, Definitions and Abbrevations, Topic body and the part Annex form.
Experiment is using accuracy (P), recall rate (R), F- value (F-Score) three common in natural language processing task Assessment indicator evaluates keyword abstraction effect, and calculation method is respectively as shown in formula (13)~(15).
Pretreated technical standard is utilized based on word frequency-Document distribution entropy method extraction technique standard universal word, It show that word frequency-Document distribution entropy optimal threshold is 5.42 by many experiments, chooses logical as technical standard greater than the word of threshold value Word obtains 13566 general terms in total, and the results are shown in Table 2 for the extraction of part general term.
2 part general term of table extracts result
Serial number General term H(W) Serial number General term H(W)
1 version 10.9665 11 all 9.9539
2 should 10.8165 12 possible 9.8908
3 latest 10.7022 13 foreword 9.8543
4 approve 10.6394 14 through 9.8097
5 specifiction 10.5639 15 modify 9.7739
6 update 10.4934 16 restriction 9.6978
7 present 10.2963 17 this 9.6536
8 within 10.1056 18 available 9.6281
9 be 10.0572 19 release 9.5941
10 further 10.0188 20 when 9.5148
From table 2 it can be seen that can not only extract common stop words based on word frequency-Document distribution entropy algorithm " all ", " this ", " when " etc., but also the general term in technical standard can be extracted, such as: "version", " specification ", " release " etc..Most of technical standard general term can effectively be obtained using this method.
After filtering using above-mentioned generic word list to the candidate key word set in each technical standard, position is calculated separately Feature, Term co-occurrence feature and the corresponding weight of context semantic feature.When wherein calculating context semantic feature, experiment choosing With in Word2vec Skip-Gram model and Huffman Softmax optimization method 14G technical standard is trained, Contextual window is set as 10, and vector dimension is set as 200, obtains 965.1M model file after iteration 10 times.In order to analyze different spies The influence to technical standard keyword abstraction is levied, the contrast characteristic's combination for testing setting is as shown in table 3.
The combination of 3 feature of table
The candidate keywords of each technical standard are calculated separately under different characteristic combination in conjunction with formula (3)~(11) Score, calculated using formula (12) and dynamic threshold and filter out qualified candidate keywords as the key identified Word.1000 technical standards comprising different series different editions are randomly selected, from 8000 technical standards simultaneously with three people Intersecting mark takes the form of intersection to screen 2,4,6,8,10 keywords respectively from each technical standard as with reference to crucial Word set.The keyword that will identify that compares after carrying out lemmatization respectively with the reference keyword set manually marked, if identification The case where keyword is identical as the keyword morphology of mark or abbreviation and full name then regard correct identification as each other, counts simultaneously Different features combines accuracy, recall rate and the F- value that keyword is identified under different keyword numbers, and experimental result is such as Shown in table 4.
4 different characteristic of table combines lower keyword abstraction result
From table 4, it can be seen that when keyword number be 2 when, number Feature1, Feature4, Feature5 and The feature identification recall rate of Feature7 is combined higher than other features.This is because those are appeared in when keyword number is less Candidate keywords on specific position are more likely correctly identified as keyword.Meanwhile the word on these specific positions provides Context semantic information it is less, therefore the position feature that occurs in technical standard of keyword is opposite accounts for leading factor.Work as pass When keyword number increases since 2, comparison Feature1~Feature3 is not difficult to find out, the corresponding recall rate of Feature1 increases Length is slow and is in be gradually reduced trend;Feature2 keyword number be 4~8 when accuracy recall rate rise obviously, with Accuracy is declined afterwards;When keyword number is more than 6, recall rate ascensional range increased Feature3.Illustrate with Keyword number increase, influence of the position to keyword weight is gradually reduced, and Term co-occurrence feature and context are semantic special Sign is gradually increased the weights influence of keyword.Meanwhile it comparing Feature5 and Feature7 and can be seen that addition Term co-occurrence is special Accuracy and recall rate are risen after sign.This is because Term co-occurrence factor helps to identify the keyword of more phrase types, And the keyword of these phrase types is likely to correspond to certain abbreviation key summarized meaning but prevent take up position advantage Word, as keyword number increases, these have larger be likely to be included in reference to crucial by the keyword that Term co-occurrence feature identifies In word set.Feature4 and Feature7 is compared it can be found that being 4 from keyword number after context semantic feature is added Start recall rate to be obviously improved.The reason is that those characterizations have the time of abundant context semantic information when keyword quantity increases Selecting keyword also and having larger may be chosen as keyword.When keyword number is identical, comparison Feature1, Feature2, Feature3 and Feature7 can be found that Feature7 is achieved due to the advantage of comprehensive different characteristic than any single features Better recognition effect.
Keyword Automatic method provided by the invention, fusion position feature, Term co-occurrence feature and context are semantic Feature extraction keyword comprehensively considers inside documents position and context semantic feature to the weights influence of keyword, reaches Higher accuracy and recall rate, improve 3GPP technical standard retrieval quality, reduce cost of labor, can be well Meet the needs of practical application.
Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not Therefore limitations on the scope of the patent of the present invention are interpreted as.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (10)

1. a kind of keyword Automatic method, which is characterized in that including:General term is extracted, candidate keywords are extracted, for time Keyword is selected to filter general term, integrated location feature, Term co-occurrence feature and context semantic feature calculate candidate keywords power Heavy point, according to candidate keywords weighted score range computation dynamic threshold, utilize dynamic threshold definitive result keyword.
2. keyword Automatic method according to claim 1, which is characterized in that the keyword Automatic method Including:
Step 1) removes the text noise in 3GPP technical standard;
General term in step 2) extraction technique standard;
Step 3) is based on parsing tree and extracts candidate keywords and filter general term;
Step 4) comprehensively considers position feature, Term co-occurrence feature and the context semantic feature of candidate keywords in a document, It calculates weighted score and sorts, dynamic threshold is finally calculated according to the practical score range of technical standard, be more than threshold value by score Result keyword collection is added in candidate keywords.
3. keyword Automatic method according to claim 1, which is characterized in that step 1) is specially:Using Apache POI analytic technique standard removes the text noise in 3GPP technical standard.
4. keyword Automatic method according to claim 1 to 3, which is characterized in that step 2) includes:Based on word frequency- General term in Document distribution entropy extraction technique standard, word frequency-Document distribution entropy refer to word w in technical standard integrated distribution shape The uncertainty measure of state;If being expressed as D={ d by the document sets that n technical standard is constituted1, d2...di...dn, remember word w Word frequency-Document distribution entropy be H (w), then H (w) calculation formula be
Wherein, P (w, di) it is that word w appears in technical standard diIn probability, 1≤i≤n, according to maximum likelihood estimate, P (w, di) calculation formula be
Wherein, f (w, di) it is word w in technical standard diThe number of middle appearance.
5. keyword Automatic method described in -4 according to claim 1, which is characterized in that taken out based on interdependent parsing tree The candidate keywords are taken to include:
Step1:Traversal technology standard set D, to each technical standard d in DiIt is divided into sentence by punctuate, by the sentence after segmentation Subset is expressed as
1≤i≤ns, nsFor document diMiddle sentence number;
Step2:To set Sentences (di) in each sentence utilize Stanford Parser syntax Analyzer carries out interdependent syntactic analysis, obtains corresponding interdependent parsing tree set Trees (di), noteWherein TiPresentation technology standard diIn i-th of sentence corresponding interdependent syntax point Analysis tree;
Step3:Circulation reads interdependent parsing tree set Trees (di), to any one interdependent syntax tree Ti∈Trees (di), by syntax dependency tree word and corresponding part of speech regard leaf node as a whole, in the orderly mode of middle sequence Traverse TiIf present node is leaf node, judge whether the part of speech of the node is noun, verb or adjective, meets item The node is then added candidate keywords and concentrated by part, otherwise skips to next node;If present node is not leaf node, sentence Whether disconnected present node is noun phrase, and if it is noun phrase and right subtree non-empty then continues the right side of recursive traversal present node Subtree, until, there is no using noun phrase as the non-leaf nodes of father node, at this time making the child node of noun phrase in subtree It is concentrated for the whole candidate keywords that are added;
Step4:Candidate key word set is further filtered using the general term extracted, if candidate keywords concentrate exist comprising The element is then concentrated from candidate keywords and is removed by the element of general term.
6. keyword Automatic method described in -5 according to claim 1, which is characterized in that the calculating side of position feature weight Method includes:Sentence collection is respectively divided by boundary of punctuate for the corresponding body part of 3GPP technical standard different levels title, Sentence in the distich subset number consecutively since 1, remembers technical standard diMiddle candidate key word set CK (di)={ ck1, ck2...cki...ckn, wherein ckiFor any one candidate keywords in set, n is candidate keywords number, remembers special bit Collection is set to be combined into
SP={ Title, Scope, Reference, Definitions, Abbrevations, NOTE },
locate(cki) indicate candidate keywords ckiThe position of appearance, defined feature function Pos (cki) indicate candidate keywords ckiWeight assignment on this dimension of appearance position, then
Wherein, SnockiIndicate candidate keywords ckiThe sentence at place is numbered, SnuckiIndicate candidate keywords ckiPlace text segment Fall middle sentence quantity, len (cki) indicate candidate keywords ckiThe word number for including;The weight that will appear in different location takes Average value remembers W (Pos (cki)) indicate position weight average value, then
Wherein, fre (cki) indicate candidate keywords ckiThe frequency occurred in same piece technical standard.
7. keyword Automatic method described in -6 according to claim 1, which is characterized in that Term co-occurrence feature weight calculating side Method is:
The candidate key word set for remembering all technical standards is CK={ CK (d1), CK (d2)...CK(di)...CK(dn), to technology Standard diIn any one candidate keywords cki, note composition ckiWord be respectively cw1, cw2…cwi…cwm, m ckiInclude Word number, remember cwiCo-occurrence word set be cocuri={ wco1, wco2…wcoi…wcop, p is the size of co-occurrence word set, Wherein wcojIndicate word cwiOne of co-occurrence word, wcoj∈CK(di) and meet wco1∩wco2∩…∩wcoj∩… ∩wcop={ cwi, wherein 1≤j≤p, then cwiTo candidate keywords ckiContribution be expressed as
Wherein, fre (corj) indicate word cwiCo-occurrence word wcojThe frequency of appearance, len (wcoj) indicate co-occurrence word wcojInclude Word number;When candidate keywords cki includes multiple words, candidate keywords ck is calculatediOn this dimension of Term co-occurrence Weight calculation formula be
8. keyword Automatic method described in -7 according to claim 1, which is characterized in that context semantic feature weight meter Calculation method is:
Calculating task is decomposed into and independently predicts that each constitutes the probability of context Context (w) word most by current word w Big value, objective function are
Wherein ci∈ Context (w), D are technical standard corpus, and θ is model parameter, conditional probability P (ci| w) it is expressed as
Wherein,And vwRespectively word ciIt is indicated with the vector of w, c ' is all unduplicated words in corpus, vc′For c ' Vector indicate;By each technical standard d in technical standard set DiRegard as by a series of word w1…wi…wnStructure At, it is assumed that it is mutually indepedent between word, to technical standard diIn each candidate keywords cki, if word type, then calculate pre- Survey probability formula be
If phrase type, then calculation formula is
Above formula both sides are taken the logP (w on the left side after logarithm1…wi…wn|cki) as measurement candidate keywords ckiIt is semantic this The weight metric of dimension is denoted as W (Sem (cki)), by logP (w1…wi…wn|cki) approximation regards logP (c as1…ci…cn| cki), wherein w1…wi…wnFor candidate keywords ckiContext within the scope of model window is abbreviated as Context (cki), Then W (Sem (cki)) calculation formula is
9. keyword Automatic method described in -8 according to claim 1, which is characterized in that the step 4) includes:
To technical standard diIn any one candidate keywords cki, comprehensively consider position feature, Term co-occurrence feature and context Semantic feature calculates candidate keywords ckiThe formula of weighted score in three characteristic dimensions is
W(cki)=W (Pos (cki))+W(Coo(cki))+W(Sem(cki));
Remember diIn each candidate keywords ckiCorresponding score Score (di)={ W (ck1)...W(cki)...W(ckn)), To Score (di) in score sort from high to low, set dynamic threshold λ as the average value of all scores, its calculation formula is
If diMiddle candidate keywords meet W (cki) >=λ, then by ckiIt is added to result keyword concentration.
10. keyword Automatic method described in -9 according to claim 1, which is characterized in that the text noise includes figure Piece, table, formula, additional character and forbidden character.
CN201810611476.7A 2018-06-13 2018-06-13 Automatic keyword extraction method Active CN108920456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810611476.7A CN108920456B (en) 2018-06-13 2018-06-13 Automatic keyword extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810611476.7A CN108920456B (en) 2018-06-13 2018-06-13 Automatic keyword extraction method

Publications (2)

Publication Number Publication Date
CN108920456A true CN108920456A (en) 2018-11-30
CN108920456B CN108920456B (en) 2022-08-30

Family

ID=64419617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810611476.7A Active CN108920456B (en) 2018-06-13 2018-06-13 Automatic keyword extraction method

Country Status (1)

Country Link
CN (1) CN108920456B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN109960724A (en) * 2019-03-13 2019-07-02 北京工业大学 A kind of text snippet method based on TF-IDF
CN110134767A (en) * 2019-05-10 2019-08-16 云知声(上海)智能科技有限公司 A kind of screening technique of vocabulary
CN110147425A (en) * 2019-05-22 2019-08-20 华泰期货有限公司 A kind of keyword extracting method, device, computer equipment and storage medium
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN111435405A (en) * 2019-01-15 2020-07-21 北京行数通科技有限公司 Method and device for automatically labeling key sentences of article
CN111552786A (en) * 2020-04-16 2020-08-18 重庆大学 Question-answering working method based on keyword extraction
CN111597793A (en) * 2020-04-20 2020-08-28 中山大学 Paper innovation measuring method based on SAO-ADV structure
CN111680509A (en) * 2020-06-10 2020-09-18 四川九洲电器集团有限责任公司 Method and device for automatically extracting text keywords based on co-occurrence language network
CN111985217A (en) * 2020-09-09 2020-11-24 吉林大学 Keyword extraction method and computing device
CN112988951A (en) * 2021-03-16 2021-06-18 福州数据技术研究院有限公司 Scientific research project review expert accurate recommendation method and storage device
CN113191145A (en) * 2021-05-21 2021-07-30 百度在线网络技术(北京)有限公司 Keyword processing method and device, electronic equipment and medium
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN113971216A (en) * 2021-10-22 2022-01-25 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and memory
CN114492433A (en) * 2022-01-27 2022-05-13 南京烽火星空通信发展有限公司 Method for automatically selecting proper keyword combination to extract text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110004465A1 (en) * 2009-07-02 2011-01-06 Battelle Memorial Institute Computation and Analysis of Significant Themes
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110004465A1 (en) * 2009-07-02 2011-01-06 Battelle Memorial Institute Computation and Analysis of Significant Themes
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
都云程等: "基于字同现频率的关键词自动抽取", 《北京信息科技大学学报》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN111435405A (en) * 2019-01-15 2020-07-21 北京行数通科技有限公司 Method and device for automatically labeling key sentences of article
CN109960724A (en) * 2019-03-13 2019-07-02 北京工业大学 A kind of text snippet method based on TF-IDF
CN110134767A (en) * 2019-05-10 2019-08-16 云知声(上海)智能科技有限公司 A kind of screening technique of vocabulary
CN110147425B (en) * 2019-05-22 2021-04-06 华泰期货有限公司 Keyword extraction method and device, computer equipment and storage medium
CN110147425A (en) * 2019-05-22 2019-08-20 华泰期货有限公司 A kind of keyword extracting method, device, computer equipment and storage medium
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN111552786A (en) * 2020-04-16 2020-08-18 重庆大学 Question-answering working method based on keyword extraction
CN111597793A (en) * 2020-04-20 2020-08-28 中山大学 Paper innovation measuring method based on SAO-ADV structure
CN111597793B (en) * 2020-04-20 2023-06-16 中山大学 Paper innovation measuring method based on SAO-ADV structure
CN111680509A (en) * 2020-06-10 2020-09-18 四川九洲电器集团有限责任公司 Method and device for automatically extracting text keywords based on co-occurrence language network
CN111985217B (en) * 2020-09-09 2022-08-02 吉林大学 Keyword extraction method, computing device and readable storage medium
CN111985217A (en) * 2020-09-09 2020-11-24 吉林大学 Keyword extraction method and computing device
CN112988951A (en) * 2021-03-16 2021-06-18 福州数据技术研究院有限公司 Scientific research project review expert accurate recommendation method and storage device
CN113191145A (en) * 2021-05-21 2021-07-30 百度在线网络技术(北京)有限公司 Keyword processing method and device, electronic equipment and medium
CN113191145B (en) * 2021-05-21 2023-08-11 百度在线网络技术(北京)有限公司 Keyword processing method and device, electronic equipment and medium
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN113743090B (en) * 2021-09-08 2024-04-12 度小满科技(北京)有限公司 Keyword extraction method and device
CN113971216A (en) * 2021-10-22 2022-01-25 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and memory
CN114492433A (en) * 2022-01-27 2022-05-13 南京烽火星空通信发展有限公司 Method for automatically selecting proper keyword combination to extract text

Also Published As

Publication number Publication date
CN108920456B (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN108920456A (en) A kind of keyword Automatic method
KR101536520B1 (en) Method and server for extracting topic and evaluating compatibility of the extracted topic
US10437867B2 (en) Scenario generating apparatus and computer program therefor
Beeferman et al. Statistical models for text segmentation
US10095685B2 (en) Phrase pair collecting apparatus and computer program therefor
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
BR112012011091B1 (en) method and apparatus for extracting and evaluating word quality
US10387805B2 (en) System and method for ranking news feeds
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN104915443B (en) A kind of abstracting method of Chinese microblogging evaluation object
CN105488098B (en) A kind of new words extraction method based on field otherness
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN109614626A (en) Keyword Automatic method based on gravitational model
CN113032557A (en) Microblog hot topic discovery method based on frequent word set and BERT semantics
CN106503256B (en) A kind of hot information method for digging based on social networks document
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN110188189A (en) A kind of method that Knowledge based engineering adaptive event index cognitive model extracts documentation summary
Menezes et al. Building a massive corpus for named entity recognition using free open data sources
Hofmann et al. Predicting the growth of morphological families from social and linguistic factors
Ertam et al. Abstractive text summarization using deep learning with a new Turkish summarization benchmark dataset
Campbell et al. Content+ context networks for user classification in twitter
KR20170048736A (en) Evnet information extraciton method for extracing the event information for text relay data, and user apparatus for perfromign the method
Patel et al. Influence of Gujarati STEmmeR in supervised learning of web page categorization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant