CN110532551A - Method, equipment and the storage medium that text key word automatically extracts - Google Patents

Method, equipment and the storage medium that text key word automatically extracts Download PDF

Info

Publication number
CN110532551A
CN110532551A CN201910754155.7A CN201910754155A CN110532551A CN 110532551 A CN110532551 A CN 110532551A CN 201910754155 A CN201910754155 A CN 201910754155A CN 110532551 A CN110532551 A CN 110532551A
Authority
CN
China
Prior art keywords
word
keyword
text
words
automatically extracts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910754155.7A
Other languages
Chinese (zh)
Inventor
龚朝辉
陶予祺
童刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Long Mobile Network Technology Co Ltd
Original Assignee
Suzhou Long Mobile Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Long Mobile Network Technology Co Ltd filed Critical Suzhou Long Mobile Network Technology Co Ltd
Priority to CN201910754155.7A priority Critical patent/CN110532551A/en
Priority to PCT/CN2019/115115 priority patent/WO2021027085A1/en
Publication of CN110532551A publication Critical patent/CN110532551A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Present invention discloses a kind of method that text key word automatically extracts, equipment and storage mediums, which comprises obtains n member candidate key set of words;By in n member candidate key set of words include identical n-1 member word and the n-1 member word two keywords different in the position of the keyword merge, obtain n+1 member result keyword set, wherein n is the positive integer greater than 1.Compared with prior art, technical solution of the present invention is by merging the keyword extracted after subdivision, so that the semanteme for the keyword being split off obtains completion, avoids because of the semantic incomplete situation of the too thin bring of participle.

Description

Method, equipment and the storage medium that text key word automatically extracts
Technical field
The present invention relates to Internet technical fields, more particularly to a kind of method that text key word automatically extracts, equipment And storage medium.
Background technique
Automatic keyword abstraction is to extract thematic or importance word or phrase automatically from text, be text retrieval, The work of the basic and necessity of many text mining tasks such as text snippet.Document keyword characterize document subject matter and Critical content is the minimum unit that document content understands.
The method of automatic keyword abstraction has very much, the automatic keyword abstraction method of mainstream, for example is based on language analysis Method, statistic law, machine learning method etc. have more than 10.Statistic law is the statistical information abstracting document using word in document Keyword, this method is comparatively relatively simple, does not need training data, does not also need external knowledge library generally, so extracting Speed it is fast, be commonly used in the scene for needing to calculate in real time.
The first step of Chinese natural language extracting keywords is exactly to segment to text, vocabulary is constructed, then from vocabulary Middle extraction keyword.This method causes keyword that can only be the word in vocabulary.Due to generally segmenting the participle granularity of tool Thinner (divide bring noise relatively small in this way, and be easy filtering), but participle can usually bring the semantic feelings isolated Condition, such as " China Internet conference " can be divided into " China ", " internet " and " conference ", if " China Internet conference " is It is keyword, this word not in vocabulary is just dropped, and will not be come out as keyword extraction.And if point of participle tool Word coarse size (such as ternary or more), bring noise can be relatively more, and sad filter, cause to extract many noise keywords.
Summary of the invention
The purpose of the present invention is to provide a kind of method that text key word automatically extracts, equipment and storage mediums.
One of for achieving the above object, an embodiment of the present invention provides a kind of side that text key word automatically extracts Method, which comprises
Obtain n member candidate key set of words;
It will in n member candidate key set of words include identical n-1 member word and the n-1 member word in the position of the keyword It sets two different keywords to merge, obtains n+1 member result keyword set, wherein n is the positive integer greater than 1.
As the further improvement of an embodiment of the present invention, the method also includes:
From the n member candidate key set of words, the n member candidate keywords that n+1 member result keyword includes are removed, are obtained N member result keyword set.
As the further improvement of an embodiment of the present invention, the method also includes:
When the word length of the keyword in the n+1 member result keyword set is greater than maximum word length, remove the key First or the last one unitary word of word obtain optimization keyword;
The optimization keyword is matched with vocabulary is limited, if matching, the optimization keyword is replaced into the n+1 Keyword in first result keyword set.
As the further improvement of an embodiment of the present invention, the step of " obtain n member candidate key set of words ", is wrapped It includes:
The text is segmented into n member set, first filters the noise in the set, then extract the key in the set Word obtains n member candidate key set of words.
As the further improvement of an embodiment of the present invention, it is 2 in the n, i.e., segments the text at two metasets After conjunction, the step of filtering noise, includes:
After segmenting the text at unitary set, the noise in the set is first filtered, then extract in the set Keyword, obtain the highest word frequency max_count in the keyword;
Word frequency is less than or equal to the word of max_count/5 in filtered binary set;
Word of the word frequency less than 2 in filtered binary set;
Word in binary set includes preceding word and rear word, and the minimum word frequency of the preceding word and rear word in unitary set is x, Word frequency is less than the word of 2x/3 in filtered binary set;
Word against regulation in filtered binary set.
As the further improvement of an embodiment of the present invention, the step of " obtain n member candidate key set of words ", is wrapped It includes:
Obtain n-1 member candidate key set of words;
It will in n-1 member candidate key set of words include identical n-2 member word and the n-2 member word in the keyword Two different keywords of position merge, and obtain n-1 member result keyword set, and wherein n is the positive integer greater than 2.
As the further improvement of an embodiment of the present invention, merge by the keyword in n member candidate key set of words Afterwards, it carries out noise filtering and obtains n+1 member result keyword set.
As the further improvement of an embodiment of the present invention, merge the keyword in the n member candidate key set of words Laggard Row noise filter the step of include:
After segmenting the text at unitary set, the noise in the set is first filtered, then extract in the set Keyword, obtain the highest word frequency max_count in the keyword;
Filter the word that the word frequency after n member candidate keywords merge is less than max_count/4;
Filter the word that the word frequency after n member candidate keywords merge is less than 2.
One of for achieving the above object, an embodiment of the present invention provides a kind of electronic equipment, including memory and Processor, the memory are stored with the computer program that can be run on the processor, and the processor executes the journey The step in method that above-mentioned text key word automatically extracts is realized when sequence.
One of for achieving the above object, an embodiment of the present invention provides a kind of computer readable storage medium, On be stored with computer program, the computer program realizes the side that above-mentioned text key word automatically extracts when being executed by processor Step in method.
Compared with prior art, technical solution of the present invention is by merging the keyword extracted after subdivision, so that The semanteme for the keyword being split off obtains completion, avoids because of the semantic incomplete situation of the too thin bring of participle.
Detailed description of the invention
Fig. 1 is the flow diagram for the method that the text key word of an embodiment of the present invention automatically extracts.
Fig. 2 is the flow diagram for the method that the text key word of the embodiment of the invention automatically extracts.
Specific embodiment
Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these embodiments are simultaneously The present invention is not limited, structure that those skilled in the art are made according to these embodiments, method or functionally Transformation is included within the scope of protection of the present invention.
It should be noted that keyword of the invention can be word, word is the minimum linguistic unit that can be used, than Such as flower, bird, people, youth, language, keyword are also possible to phrase, and phrase is combined by two or more words Syntactical unit, such as prioritization scheme, intellectual property, China Internet conference, national institute of patent agents, China etc..Cause This, the n-gram word (n is positive integer) in the present invention, refer to include n word word or phrase, for example unitary word refers to word, Binary word is the phrase of two combinations of words together, i.e., binary word includes two unitary words, and so on.
As shown in Figure 1, the method that text key word of the invention automatically extracts includes:
Obtain n member candidate key set of words;
It will in n member candidate key set of words include identical n-1 member word and the n-1 member word in the position of the keyword It sets two different keywords to merge, obtains n+1 member result keyword set, wherein n is the positive integer greater than 1.
In the present invention, there are many acquisition modes of n member candidate key set of words, rear extended meeting is specifically introduced.Obtaining n member After candidate key set of words, the keyword in this set is compared one by one two-by-two, finding includes identical n-1 member word and the n-1 First word two keywords different in the position of the keyword, are merged into n+1 member result keyword, obtain n+1 member result pass Keyword set.
The method is explained with specific example below, by taking n is equal to 2 as an example, the binary got is candidate Keyword set are as follows: { key benefits, smart phone, the conflict of interest, semiconductor field, Android operation, operating system, operation life State attracts talent, manpower shortage }, since " key benefits " and " conflict of interest " all include unitary word " interests ", and " interests " Position in the two keywords is different, therefore can in sequence merge the two words, the ternary knot after merging Fruit keyword is " key benefits conflict ", and so on, obtain ternary result keyword set are as follows: { key benefits conflict, Android Operating system, Android operation ecology, attract talent deficiency }.
Technical solution of the present invention is by merging the keyword extracted after subdivision, so that the keyword being split off Semanteme obtains completion, avoids because of the semantic incomplete situation of the too thin bring of participle.
Further, after merging the keyword in n member candidate key set of words, advanced Row noise filtering obtains n+1 First result keyword set.
Noise filtering, which refers to remove, some does not meet grammatical norm or undesirable keyword.
In a preferred embodiment, merge the laggard Row noise mistake of keyword in the n member candidate key set of words The step of filter includes:
After segmenting the text of keyword to be extracted at unitary set, the noise in the set is first filtered, then extract Keyword in the set obtains the highest word frequency max_count in the keyword;
Filter the word that the word frequency after n member candidate keywords merge is less than max_count/4;
Filter the word that the word frequency after n member candidate keywords merge is less than 2.
In this embodiment, the participles tool such as jieba, hanlp, stanfordNLP and thulac can be used, it will be to The text participle of extracting keywords is at unitary set (being referred to as unigram set), then to the word in this unitary set Noise filtering is carried out, specifically can be and part of speech, word frequency and the word length of the word in set are filtered.Part of speech filtering can filter Fall adjective, adverbial word and preposition etc., only retains noun and verb.Word frequency filtering refers to that the frequency for filtering out and occurring in the text is big In maximum word frequency or less than the word of minimum word frequency.The long filtering of word refers to that the length for filtering out and occurring in the text is greater than and most greatly enhances Degree or less than minimum length word.The long filtering of word frequency and word is with the experience in tens of thousands of or even several ten million samples that are collected into Data derive maximum word frequency, minimum word frequency, maximum length, minimum length etc., are then filtered.
After filtering out the noise in unitary set, keyword abstraction algorithm, such as TF-IDF algorithm and TextRank are used Algorithm etc. extracts the keyword in the unitary set.Present invention preferably uses TF-IDF algorithms to extract keyword.TF-IDF benefit Global statistics IDF (inverse text frequency) and TF (word frequency) of the word in current document with word in corpus calculate word The weight of language, using the forward word of weight as keyword.TF (term frequency) is specified word going out in the text Existing word number.IDF (inversed document frequency) is total number of documents and the text comprising specifying word in corpus The ratio of gear number takes logarithm again.TF-IDF is the product of TF and IDF.The TF-IDF of each word in document is calculated as word Weight carry out keyword screening.Extracted in the way of keyword can have at least following two by TF-IDF: in the way of one, absolutely Value, all weights are more than that the word in the set of some fixed value is all extracted as keyword.Mode two, opposite value, set In weight ranking before several word be extracted as keyword.
After extracting the keyword in unitary set, unitary candidate key set of words is obtained, finds the keyword in this set Highest word frequency max_count, filtering n member candidate keywords merge after word frequency be less than max_count/4 word.It needs to illustrate Be, if it is desired to the keyword of more n+1 members can suitably relax the condition of filtering, for example can filter that n member is candidate to close Word frequency after keyword merging is less than the word of max_count/5, if it is desired to which the keyword of few some n+1 members can be tightened suitably The condition of filtering, for example the word that the word frequency after n member candidate keywords merge is less than max_count/3 can be filtered, and so on. In addition, also to filter it is some accidentally combine, primary word only occur, i.e., filtering n member candidate keywords merge after word Frequency is less than 2 word.
By noise filtering, the n+1 member result keyword set that noise is few and accuracy rate is relatively high is obtained.
Further, the method also includes: from the n member candidate key set of words, remove n+1 member result keyword In include n member candidate keywords, obtain n member result keyword set.
Or illustrated with above example, binary candidate key set of words are as follows: { key benefits, smart phone, interests punching Prominent, semiconductor field, Android operation, operating system, operation is ecological, attracts talent, manpower shortage }, it is crucial to remove ternary result The binary candidate keywords for including in word, such as " key benefits conflict " include " key benefits " and " conflict of interest ", therefore are moved Except " key benefits " and " conflict of interest " in binary candidate key set of words, and so on, obtain binary outcome keyword set It is combined into { smart phone, semiconductor field }.
After keyword merges, with the growth of length, error rate can also be improved, in a preferred embodiment, institute State method further include:
When the word length of the keyword in the n+1 member result keyword set is greater than maximum word length, remove the key First or the last one unitary word of word obtain optimization keyword;
The optimization keyword is matched with vocabulary is limited, if matching, the optimization keyword is replaced into the n+1 Keyword in first result keyword set.
The restriction vocabulary can be customized according to actual needs, for example can be the proper noun and each public affairs of input method Full name and abbreviation of department etc..The maximum word length can be gained through experience.
Further, the step of described " obtain n member candidate key set of words " includes:
The text is segmented into n member set, first filters the noise in the set, then extract the key in the set Word obtains n member candidate key set of words.This step is similar with the step of obtaining unitary candidate key set of words, the difference is that making an uproar The filter type of sound is different, and with the increase of first number, noise can be more, and the mode of filtering can be more complicated.
By taking n is equal to 2 as an example, after the text is segmented into binary set, the step of filtering noise, includes:
Obtain the highest word frequency max_count in unitary candidate key set of words;In filtered binary set word frequency be less than or Word equal to max_count/5.Max_count/5 herein be it is adjustable, be also possible to max_count/6 or max_ count/4。
Filter it is some accidentally combine, primary word only occur, i.e., in filtered binary set word frequency be less than 2 word.
In addition, the word in binary set includes two unitary words, preceding word and rear word (preceding unitary word and rear unitary in other words Word), the minimum word frequency of the preceding word and rear word in unitary candidate key set of words is that (word frequency of word is 3 to x before such as, rear word Word frequency be 4, x=3), word frequency is less than the word of 2x/3 in filtered binary set.
Word against regulation in filtered binary set is also wanted simultaneously, word against regulation can be grammer apparent error Word (for example suffix word is preposition), include the word (such as " in tomorrow ") of preposition or include unit word (such as " 80 Member ") etc..The filtration step of unitary set is compared, the filtration step of binary set is more complicated.
It should be noted that obtaining n member candidate key set of words when n is the positive integer greater than 2 and being also possible to pass through conjunction And n-1 member candidate keywords and obtain:
Obtain n-1 member candidate key set of words;
It will in n-1 member candidate key set of words include identical n-2 member word and the n-2 member word in the keyword Two different keywords of position merge, and obtain n-1 member result keyword set, and wherein n is the positive integer greater than 2.
The principle of this embodiment is already explained above, does not just repeat here.
It is explained further below by specific embodiment, is extracted using the method that above-mentioned text key word automatically extracts The process of text key word:
As shown in Fig. 2, segmenting text (subsequent the to become the text) difference of tool by keyword to be extracted using jiaba Unitary set and binary set are segmented into, the noise of unitary set is filtered, the keyword of unitary set is extracted using TF-IDF, is obtained To unitary candidate key set of words, the highest word frequency max_count in the set is obtained.By max_count and above-mentioned Mode, the noise in filtered binary set are extracted the keyword of binary set using TF-IDF, obtain binary candidate key word set It closes.
In unitary candidate key set of words, the unitary candidate keywords that removal binary candidate keywords include obtain one First result keyword set.
Keyword in binary candidate key set of words is merged, and filtering noise (with reference to above), obtains ternary Result keyword set, while in binary candidate key set of words, the binary candidate that removal ternary result keyword includes is closed Keyword obtains binary outcome keyword set.
Extract the result of the text key word are as follows: unitary result keyword set, binary outcome keyword combination and three First result keyword set.
Certainly, for last as a result, length limitation can be carried out, when the word length of the keyword in result keyword set When greater than maximum word length, keyword is optimized, the mode of optimization is with reference to above.
The present invention also provides a kind of electronic equipment, including memory and processor, the memory is stored with can be described The computer program run on processor, the processor realize what above-mentioned text key word automatically extracted when executing described program Step in method.
The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer journey When sequence is executed by processor, the step in method that above-mentioned text key word automatically extracts is realized.
It should be appreciated that although this specification is described in terms of embodiments, but not each embodiment only includes one A independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should will say As a whole, the technical solution in each embodiment may also be suitably combined to form those skilled in the art can for bright book With the other embodiments of understanding.
The series of detailed descriptions listed above only for feasible embodiment of the invention specifically Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention Or change should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of method that text key word automatically extracts, which is characterized in that the described method includes:
Obtain n member candidate key set of words;
By in n member candidate key set of words include identical n-1 member word and the n-1 member word the position of the keyword not Two same keywords merge, and obtain n+1 member result keyword set, and wherein n is the positive integer greater than 1.
2. the method that text key word as described in claim 1 automatically extracts, which is characterized in that the method also includes:
From the n member candidate key set of words, the n member candidate keywords that n+1 member result keyword includes are removed, obtain n member Result keyword set.
3. the method that text key word as described in claim 1 automatically extracts, which is characterized in that the method also includes:
When the word length of the keyword in the n+1 member result keyword set is greater than maximum word length, remove the keyword First or the last one unitary word obtain optimization keyword;
The optimization keyword is matched with vocabulary is limited, if matching, the optimization keyword is replaced into the n+1 member knot Keyword in fruit keyword set.
4. the method that text key word according to claim 1 automatically extracts, which is characterized in that described " it is candidate to obtain n member The step of keyword set " includes:
The text is segmented into n member set, first filters the noise in the set, then extract the keyword in the set, Obtain n member candidate key set of words.
5. the method that text key word according to claim 4 automatically extracts, which is characterized in that the n be 2, i.e., will After text participle is at binary set, the step of filtering noise, includes:
After segmenting the text at unitary set, the noise in the set is first filtered, then extract the pass in the set Keyword obtains the highest word frequency max_count in the keyword;
Word frequency is less than or equal to the word of max_count/5 in filtered binary set;
Word of the word frequency less than 2 in filtered binary set;
Word in binary set includes preceding word and rear word, and the minimum word frequency of the preceding word and rear word in unitary set is x, filtering Word frequency is less than the word of 2x/3 in binary set;
Word against regulation in filtered binary set.
6. the method that text key word according to claim 1 automatically extracts, which is characterized in that described " it is candidate to obtain n member The step of keyword set " includes:
Obtain n-1 member candidate key set of words;
It will in n-1 member candidate key set of words include identical n-2 member word and the n-2 member word in the position of the keyword Two different keywords merge, and obtain n-1 member result keyword set, and wherein n is the positive integer greater than 2.
7. the method that text key word according to claim 1 automatically extracts, it is characterised in that:
After merging the keyword in n member candidate key set of words, carries out noise filtering and obtain n+1 member result keyword collection It closes.
8. the method that text key word according to claim 7 automatically extracts, which is characterized in that it is candidate to merge the n member The step of keyword in keyword set laggard Row noise filtering includes:
After segmenting the text at unitary set, the noise in the set is first filtered, then extract the pass in the set Keyword obtains the highest word frequency max_count in the keyword;
Filter the word that the word frequency after n member candidate keywords merge is less than max_count/4;
Filter the word that the word frequency after n member candidate keywords merge is less than 2.
9. a kind of electronic equipment, including memory and processor, the memory is stored with and can run on the processor Computer program, which is characterized in that the processor realizes text described in claim 1-8 any one when executing described program The step in method that keyword automatically extracts.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step in method that text key word described in realization claim 1-8 any one automatically extracts when being executed by processor.
CN201910754155.7A 2019-08-15 2019-08-15 Method, equipment and the storage medium that text key word automatically extracts Pending CN110532551A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910754155.7A CN110532551A (en) 2019-08-15 2019-08-15 Method, equipment and the storage medium that text key word automatically extracts
PCT/CN2019/115115 WO2021027085A1 (en) 2019-08-15 2019-11-01 Method and device for automatically extracting text keyword, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910754155.7A CN110532551A (en) 2019-08-15 2019-08-15 Method, equipment and the storage medium that text key word automatically extracts

Publications (1)

Publication Number Publication Date
CN110532551A true CN110532551A (en) 2019-12-03

Family

ID=68663358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910754155.7A Pending CN110532551A (en) 2019-08-15 2019-08-15 Method, equipment and the storage medium that text key word automatically extracts

Country Status (2)

Country Link
CN (1) CN110532551A (en)
WO (1) WO2021027085A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488727A (en) * 2020-03-24 2020-08-04 南阳柯丽尔科技有限公司 Word file parsing method, word file parsing apparatus, and computer-readable storage medium
CN116978384A (en) * 2023-09-25 2023-10-31 成都市青羊大数据有限责任公司 Public security integrated big data management system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1354432A (en) * 2000-11-17 2002-06-19 意蓝科技股份有限公司 Method for automatically-searching key word from file and its system
CN102375863A (en) * 2010-08-27 2012-03-14 北京四维图新科技股份有限公司 Method and device for keyword extraction in geographic information field
CN102411563A (en) * 2010-09-26 2012-04-11 阿里巴巴集团控股有限公司 Method, device and system for identifying target words
CN103092979A (en) * 2013-01-31 2013-05-08 中国科学院对地观测与数字地球科学中心 Processing method and device for searching of natural language by remote sensing data
CN103678318A (en) * 2012-08-31 2014-03-26 富士通株式会社 Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN104216875A (en) * 2014-09-26 2014-12-17 中国科学院自动化研究所 Automatic microblog text abstracting method based on unsupervised key bigram extraction
CN105426539A (en) * 2015-12-23 2016-03-23 成都电科心通捷信科技有限公司 Dictionary-based lucene Chinese word segmentation method
CN107665191A (en) * 2017-10-19 2018-02-06 中国人民解放军陆军工程大学 A kind of proprietary protocol message format estimating method based on expanded prefix tree

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2003256456A1 (en) * 2002-07-03 2004-01-23 Word Data Corp. Text-representation, text-matching and text-classification code, system and method
CN100520782C (en) * 2007-11-09 2009-07-29 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN106557459B (en) * 2015-09-24 2019-12-27 北京神州泰岳软件股份有限公司 Method and device for extracting new words from work order
CN105956158B (en) * 2016-05-17 2019-08-09 清华大学 The method that network neologisms based on massive micro-blog text and user information automatically extract

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1354432A (en) * 2000-11-17 2002-06-19 意蓝科技股份有限公司 Method for automatically-searching key word from file and its system
CN102375863A (en) * 2010-08-27 2012-03-14 北京四维图新科技股份有限公司 Method and device for keyword extraction in geographic information field
CN102411563A (en) * 2010-09-26 2012-04-11 阿里巴巴集团控股有限公司 Method, device and system for identifying target words
CN103678318A (en) * 2012-08-31 2014-03-26 富士通株式会社 Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN103092979A (en) * 2013-01-31 2013-05-08 中国科学院对地观测与数字地球科学中心 Processing method and device for searching of natural language by remote sensing data
CN104216875A (en) * 2014-09-26 2014-12-17 中国科学院自动化研究所 Automatic microblog text abstracting method based on unsupervised key bigram extraction
CN105426539A (en) * 2015-12-23 2016-03-23 成都电科心通捷信科技有限公司 Dictionary-based lucene Chinese word segmentation method
CN107665191A (en) * 2017-10-19 2018-02-06 中国人民解放军陆军工程大学 A kind of proprietary protocol message format estimating method based on expanded prefix tree

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488727A (en) * 2020-03-24 2020-08-04 南阳柯丽尔科技有限公司 Word file parsing method, word file parsing apparatus, and computer-readable storage medium
CN111488727B (en) * 2020-03-24 2023-09-19 南阳柯丽尔科技有限公司 Word file parsing method, word file parsing apparatus, and computer-readable storage medium
CN116978384A (en) * 2023-09-25 2023-10-31 成都市青羊大数据有限责任公司 Public security integrated big data management system
CN116978384B (en) * 2023-09-25 2024-01-02 成都市青羊大数据有限责任公司 Public security integrated big data management system

Also Published As

Publication number Publication date
WO2021027085A1 (en) 2021-02-18

Similar Documents

Publication Publication Date Title
Verma et al. Tokenization and filtering process in RapidMiner
CN102169495B (en) Industry dictionary generating method and device
CN106021272B (en) The keyword extraction method calculated based on distributed expression term vector
CN106294320B (en) A kind of terminology extraction method and system towards academic paper
CN104598532A (en) Information processing method and device
US20130339373A1 (en) Method and system of filtering and recommending documents
Al-Taani et al. An extractive graph-based Arabic text summarization approach
Mizumoto et al. Sentiment analysis of stock market news with semi-supervised learning
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN108170666A (en) A kind of improved method based on TF-IDF keyword extractions
Pabitha et al. Automatic question generation system
Prasad WING-NUS at CL-SciSumm 2017: Learning from Syntactic and Semantic Similarity for Citation Contextualization.
CN110532551A (en) Method, equipment and the storage medium that text key word automatically extracts
Desai et al. Automatic text summarization using supervised machine learning technique for Hindi langauge
Sheeba et al. Improved keyword and keyphrase extraction from meeting transcripts
KR20210062934A (en) Text document cluster and topic generation apparatus and method thereof
Skorkovská Application of lemmatization and summarization methods in topic identification module for large scale language modeling data filtering
Agathangelou et al. Mining domain-specific dictionaries of opinion words
KR101120038B1 (en) Neologism selection apparatus and its method
KR101290439B1 (en) Method for summerizing meeting minutes based on sentence network
CN104166712A (en) Method and system for scientific and technical literature retrieval
Juric et al. Discovering links between political debates and media
Al-Thwaib Text Summarization as Feature Selection for Arabic Text Classification.
Rostami et al. Proposing a method to classify texts using data mining
CN115455975A (en) Method and device for extracting topic keywords based on multi-model fusion decision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191203

WD01 Invention patent application deemed withdrawn after publication