CN110532551A - Method, equipment and the storage medium that text key word automatically extracts - Google Patents
Method, equipment and the storage medium that text key word automatically extracts Download PDFInfo
- Publication number
- CN110532551A CN110532551A CN201910754155.7A CN201910754155A CN110532551A CN 110532551 A CN110532551 A CN 110532551A CN 201910754155 A CN201910754155 A CN 201910754155A CN 110532551 A CN110532551 A CN 110532551A
- Authority
- CN
- China
- Prior art keywords
- word
- keyword
- text
- words
- automatically extracts
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Present invention discloses a kind of method that text key word automatically extracts, equipment and storage mediums, which comprises obtains n member candidate key set of words;By in n member candidate key set of words include identical n-1 member word and the n-1 member word two keywords different in the position of the keyword merge, obtain n+1 member result keyword set, wherein n is the positive integer greater than 1.Compared with prior art, technical solution of the present invention is by merging the keyword extracted after subdivision, so that the semanteme for the keyword being split off obtains completion, avoids because of the semantic incomplete situation of the too thin bring of participle.
Description
Technical field
The present invention relates to Internet technical fields, more particularly to a kind of method that text key word automatically extracts, equipment
And storage medium.
Background technique
Automatic keyword abstraction is to extract thematic or importance word or phrase automatically from text, be text retrieval,
The work of the basic and necessity of many text mining tasks such as text snippet.Document keyword characterize document subject matter and
Critical content is the minimum unit that document content understands.
The method of automatic keyword abstraction has very much, the automatic keyword abstraction method of mainstream, for example is based on language analysis
Method, statistic law, machine learning method etc. have more than 10.Statistic law is the statistical information abstracting document using word in document
Keyword, this method is comparatively relatively simple, does not need training data, does not also need external knowledge library generally, so extracting
Speed it is fast, be commonly used in the scene for needing to calculate in real time.
The first step of Chinese natural language extracting keywords is exactly to segment to text, vocabulary is constructed, then from vocabulary
Middle extraction keyword.This method causes keyword that can only be the word in vocabulary.Due to generally segmenting the participle granularity of tool
Thinner (divide bring noise relatively small in this way, and be easy filtering), but participle can usually bring the semantic feelings isolated
Condition, such as " China Internet conference " can be divided into " China ", " internet " and " conference ", if " China Internet conference " is
It is keyword, this word not in vocabulary is just dropped, and will not be come out as keyword extraction.And if point of participle tool
Word coarse size (such as ternary or more), bring noise can be relatively more, and sad filter, cause to extract many noise keywords.
Summary of the invention
The purpose of the present invention is to provide a kind of method that text key word automatically extracts, equipment and storage mediums.
One of for achieving the above object, an embodiment of the present invention provides a kind of side that text key word automatically extracts
Method, which comprises
Obtain n member candidate key set of words;
It will in n member candidate key set of words include identical n-1 member word and the n-1 member word in the position of the keyword
It sets two different keywords to merge, obtains n+1 member result keyword set, wherein n is the positive integer greater than 1.
As the further improvement of an embodiment of the present invention, the method also includes:
From the n member candidate key set of words, the n member candidate keywords that n+1 member result keyword includes are removed, are obtained
N member result keyword set.
As the further improvement of an embodiment of the present invention, the method also includes:
When the word length of the keyword in the n+1 member result keyword set is greater than maximum word length, remove the key
First or the last one unitary word of word obtain optimization keyword;
The optimization keyword is matched with vocabulary is limited, if matching, the optimization keyword is replaced into the n+1
Keyword in first result keyword set.
As the further improvement of an embodiment of the present invention, the step of " obtain n member candidate key set of words ", is wrapped
It includes:
The text is segmented into n member set, first filters the noise in the set, then extract the key in the set
Word obtains n member candidate key set of words.
As the further improvement of an embodiment of the present invention, it is 2 in the n, i.e., segments the text at two metasets
After conjunction, the step of filtering noise, includes:
After segmenting the text at unitary set, the noise in the set is first filtered, then extract in the set
Keyword, obtain the highest word frequency max_count in the keyword;
Word frequency is less than or equal to the word of max_count/5 in filtered binary set;
Word of the word frequency less than 2 in filtered binary set;
Word in binary set includes preceding word and rear word, and the minimum word frequency of the preceding word and rear word in unitary set is x,
Word frequency is less than the word of 2x/3 in filtered binary set;
Word against regulation in filtered binary set.
As the further improvement of an embodiment of the present invention, the step of " obtain n member candidate key set of words ", is wrapped
It includes:
Obtain n-1 member candidate key set of words;
It will in n-1 member candidate key set of words include identical n-2 member word and the n-2 member word in the keyword
Two different keywords of position merge, and obtain n-1 member result keyword set, and wherein n is the positive integer greater than 2.
As the further improvement of an embodiment of the present invention, merge by the keyword in n member candidate key set of words
Afterwards, it carries out noise filtering and obtains n+1 member result keyword set.
As the further improvement of an embodiment of the present invention, merge the keyword in the n member candidate key set of words
Laggard Row noise filter the step of include:
After segmenting the text at unitary set, the noise in the set is first filtered, then extract in the set
Keyword, obtain the highest word frequency max_count in the keyword;
Filter the word that the word frequency after n member candidate keywords merge is less than max_count/4;
Filter the word that the word frequency after n member candidate keywords merge is less than 2.
One of for achieving the above object, an embodiment of the present invention provides a kind of electronic equipment, including memory and
Processor, the memory are stored with the computer program that can be run on the processor, and the processor executes the journey
The step in method that above-mentioned text key word automatically extracts is realized when sequence.
One of for achieving the above object, an embodiment of the present invention provides a kind of computer readable storage medium,
On be stored with computer program, the computer program realizes the side that above-mentioned text key word automatically extracts when being executed by processor
Step in method.
Compared with prior art, technical solution of the present invention is by merging the keyword extracted after subdivision, so that
The semanteme for the keyword being split off obtains completion, avoids because of the semantic incomplete situation of the too thin bring of participle.
Detailed description of the invention
Fig. 1 is the flow diagram for the method that the text key word of an embodiment of the present invention automatically extracts.
Fig. 2 is the flow diagram for the method that the text key word of the embodiment of the invention automatically extracts.
Specific embodiment
Below with reference to specific embodiment shown in the drawings, the present invention will be described in detail.But these embodiments are simultaneously
The present invention is not limited, structure that those skilled in the art are made according to these embodiments, method or functionally
Transformation is included within the scope of protection of the present invention.
It should be noted that keyword of the invention can be word, word is the minimum linguistic unit that can be used, than
Such as flower, bird, people, youth, language, keyword are also possible to phrase, and phrase is combined by two or more words
Syntactical unit, such as prioritization scheme, intellectual property, China Internet conference, national institute of patent agents, China etc..Cause
This, the n-gram word (n is positive integer) in the present invention, refer to include n word word or phrase, for example unitary word refers to word,
Binary word is the phrase of two combinations of words together, i.e., binary word includes two unitary words, and so on.
As shown in Figure 1, the method that text key word of the invention automatically extracts includes:
Obtain n member candidate key set of words;
It will in n member candidate key set of words include identical n-1 member word and the n-1 member word in the position of the keyword
It sets two different keywords to merge, obtains n+1 member result keyword set, wherein n is the positive integer greater than 1.
In the present invention, there are many acquisition modes of n member candidate key set of words, rear extended meeting is specifically introduced.Obtaining n member
After candidate key set of words, the keyword in this set is compared one by one two-by-two, finding includes identical n-1 member word and the n-1
First word two keywords different in the position of the keyword, are merged into n+1 member result keyword, obtain n+1 member result pass
Keyword set.
The method is explained with specific example below, by taking n is equal to 2 as an example, the binary got is candidate
Keyword set are as follows: { key benefits, smart phone, the conflict of interest, semiconductor field, Android operation, operating system, operation life
State attracts talent, manpower shortage }, since " key benefits " and " conflict of interest " all include unitary word " interests ", and " interests "
Position in the two keywords is different, therefore can in sequence merge the two words, the ternary knot after merging
Fruit keyword is " key benefits conflict ", and so on, obtain ternary result keyword set are as follows: { key benefits conflict, Android
Operating system, Android operation ecology, attract talent deficiency }.
Technical solution of the present invention is by merging the keyword extracted after subdivision, so that the keyword being split off
Semanteme obtains completion, avoids because of the semantic incomplete situation of the too thin bring of participle.
Further, after merging the keyword in n member candidate key set of words, advanced Row noise filtering obtains n+1
First result keyword set.
Noise filtering, which refers to remove, some does not meet grammatical norm or undesirable keyword.
In a preferred embodiment, merge the laggard Row noise mistake of keyword in the n member candidate key set of words
The step of filter includes:
After segmenting the text of keyword to be extracted at unitary set, the noise in the set is first filtered, then extract
Keyword in the set obtains the highest word frequency max_count in the keyword;
Filter the word that the word frequency after n member candidate keywords merge is less than max_count/4;
Filter the word that the word frequency after n member candidate keywords merge is less than 2.
In this embodiment, the participles tool such as jieba, hanlp, stanfordNLP and thulac can be used, it will be to
The text participle of extracting keywords is at unitary set (being referred to as unigram set), then to the word in this unitary set
Noise filtering is carried out, specifically can be and part of speech, word frequency and the word length of the word in set are filtered.Part of speech filtering can filter
Fall adjective, adverbial word and preposition etc., only retains noun and verb.Word frequency filtering refers to that the frequency for filtering out and occurring in the text is big
In maximum word frequency or less than the word of minimum word frequency.The long filtering of word refers to that the length for filtering out and occurring in the text is greater than and most greatly enhances
Degree or less than minimum length word.The long filtering of word frequency and word is with the experience in tens of thousands of or even several ten million samples that are collected into
Data derive maximum word frequency, minimum word frequency, maximum length, minimum length etc., are then filtered.
After filtering out the noise in unitary set, keyword abstraction algorithm, such as TF-IDF algorithm and TextRank are used
Algorithm etc. extracts the keyword in the unitary set.Present invention preferably uses TF-IDF algorithms to extract keyword.TF-IDF benefit
Global statistics IDF (inverse text frequency) and TF (word frequency) of the word in current document with word in corpus calculate word
The weight of language, using the forward word of weight as keyword.TF (term frequency) is specified word going out in the text
Existing word number.IDF (inversed document frequency) is total number of documents and the text comprising specifying word in corpus
The ratio of gear number takes logarithm again.TF-IDF is the product of TF and IDF.The TF-IDF of each word in document is calculated as word
Weight carry out keyword screening.Extracted in the way of keyword can have at least following two by TF-IDF: in the way of one, absolutely
Value, all weights are more than that the word in the set of some fixed value is all extracted as keyword.Mode two, opposite value, set
In weight ranking before several word be extracted as keyword.
After extracting the keyword in unitary set, unitary candidate key set of words is obtained, finds the keyword in this set
Highest word frequency max_count, filtering n member candidate keywords merge after word frequency be less than max_count/4 word.It needs to illustrate
Be, if it is desired to the keyword of more n+1 members can suitably relax the condition of filtering, for example can filter that n member is candidate to close
Word frequency after keyword merging is less than the word of max_count/5, if it is desired to which the keyword of few some n+1 members can be tightened suitably
The condition of filtering, for example the word that the word frequency after n member candidate keywords merge is less than max_count/3 can be filtered, and so on.
In addition, also to filter it is some accidentally combine, primary word only occur, i.e., filtering n member candidate keywords merge after word
Frequency is less than 2 word.
By noise filtering, the n+1 member result keyword set that noise is few and accuracy rate is relatively high is obtained.
Further, the method also includes: from the n member candidate key set of words, remove n+1 member result keyword
In include n member candidate keywords, obtain n member result keyword set.
Or illustrated with above example, binary candidate key set of words are as follows: { key benefits, smart phone, interests punching
Prominent, semiconductor field, Android operation, operating system, operation is ecological, attracts talent, manpower shortage }, it is crucial to remove ternary result
The binary candidate keywords for including in word, such as " key benefits conflict " include " key benefits " and " conflict of interest ", therefore are moved
Except " key benefits " and " conflict of interest " in binary candidate key set of words, and so on, obtain binary outcome keyword set
It is combined into { smart phone, semiconductor field }.
After keyword merges, with the growth of length, error rate can also be improved, in a preferred embodiment, institute
State method further include:
When the word length of the keyword in the n+1 member result keyword set is greater than maximum word length, remove the key
First or the last one unitary word of word obtain optimization keyword;
The optimization keyword is matched with vocabulary is limited, if matching, the optimization keyword is replaced into the n+1
Keyword in first result keyword set.
The restriction vocabulary can be customized according to actual needs, for example can be the proper noun and each public affairs of input method
Full name and abbreviation of department etc..The maximum word length can be gained through experience.
Further, the step of described " obtain n member candidate key set of words " includes:
The text is segmented into n member set, first filters the noise in the set, then extract the key in the set
Word obtains n member candidate key set of words.This step is similar with the step of obtaining unitary candidate key set of words, the difference is that making an uproar
The filter type of sound is different, and with the increase of first number, noise can be more, and the mode of filtering can be more complicated.
By taking n is equal to 2 as an example, after the text is segmented into binary set, the step of filtering noise, includes:
Obtain the highest word frequency max_count in unitary candidate key set of words;In filtered binary set word frequency be less than or
Word equal to max_count/5.Max_count/5 herein be it is adjustable, be also possible to max_count/6 or max_
count/4。
Filter it is some accidentally combine, primary word only occur, i.e., in filtered binary set word frequency be less than 2 word.
In addition, the word in binary set includes two unitary words, preceding word and rear word (preceding unitary word and rear unitary in other words
Word), the minimum word frequency of the preceding word and rear word in unitary candidate key set of words is that (word frequency of word is 3 to x before such as, rear word
Word frequency be 4, x=3), word frequency is less than the word of 2x/3 in filtered binary set.
Word against regulation in filtered binary set is also wanted simultaneously, word against regulation can be grammer apparent error
Word (for example suffix word is preposition), include the word (such as " in tomorrow ") of preposition or include unit word (such as " 80
Member ") etc..The filtration step of unitary set is compared, the filtration step of binary set is more complicated.
It should be noted that obtaining n member candidate key set of words when n is the positive integer greater than 2 and being also possible to pass through conjunction
And n-1 member candidate keywords and obtain:
Obtain n-1 member candidate key set of words;
It will in n-1 member candidate key set of words include identical n-2 member word and the n-2 member word in the keyword
Two different keywords of position merge, and obtain n-1 member result keyword set, and wherein n is the positive integer greater than 2.
The principle of this embodiment is already explained above, does not just repeat here.
It is explained further below by specific embodiment, is extracted using the method that above-mentioned text key word automatically extracts
The process of text key word:
As shown in Fig. 2, segmenting text (subsequent the to become the text) difference of tool by keyword to be extracted using jiaba
Unitary set and binary set are segmented into, the noise of unitary set is filtered, the keyword of unitary set is extracted using TF-IDF, is obtained
To unitary candidate key set of words, the highest word frequency max_count in the set is obtained.By max_count and above-mentioned
Mode, the noise in filtered binary set are extracted the keyword of binary set using TF-IDF, obtain binary candidate key word set
It closes.
In unitary candidate key set of words, the unitary candidate keywords that removal binary candidate keywords include obtain one
First result keyword set.
Keyword in binary candidate key set of words is merged, and filtering noise (with reference to above), obtains ternary
Result keyword set, while in binary candidate key set of words, the binary candidate that removal ternary result keyword includes is closed
Keyword obtains binary outcome keyword set.
Extract the result of the text key word are as follows: unitary result keyword set, binary outcome keyword combination and three
First result keyword set.
Certainly, for last as a result, length limitation can be carried out, when the word length of the keyword in result keyword set
When greater than maximum word length, keyword is optimized, the mode of optimization is with reference to above.
The present invention also provides a kind of electronic equipment, including memory and processor, the memory is stored with can be described
The computer program run on processor, the processor realize what above-mentioned text key word automatically extracted when executing described program
Step in method.
The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer journey
When sequence is executed by processor, the step in method that above-mentioned text key word automatically extracts is realized.
It should be appreciated that although this specification is described in terms of embodiments, but not each embodiment only includes one
A independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should will say
As a whole, the technical solution in each embodiment may also be suitably combined to form those skilled in the art can for bright book
With the other embodiments of understanding.
The series of detailed descriptions listed above only for feasible embodiment of the invention specifically
Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention
Or change should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of method that text key word automatically extracts, which is characterized in that the described method includes:
Obtain n member candidate key set of words;
By in n member candidate key set of words include identical n-1 member word and the n-1 member word the position of the keyword not
Two same keywords merge, and obtain n+1 member result keyword set, and wherein n is the positive integer greater than 1.
2. the method that text key word as described in claim 1 automatically extracts, which is characterized in that the method also includes:
From the n member candidate key set of words, the n member candidate keywords that n+1 member result keyword includes are removed, obtain n member
Result keyword set.
3. the method that text key word as described in claim 1 automatically extracts, which is characterized in that the method also includes:
When the word length of the keyword in the n+1 member result keyword set is greater than maximum word length, remove the keyword
First or the last one unitary word obtain optimization keyword;
The optimization keyword is matched with vocabulary is limited, if matching, the optimization keyword is replaced into the n+1 member knot
Keyword in fruit keyword set.
4. the method that text key word according to claim 1 automatically extracts, which is characterized in that described " it is candidate to obtain n member
The step of keyword set " includes:
The text is segmented into n member set, first filters the noise in the set, then extract the keyword in the set,
Obtain n member candidate key set of words.
5. the method that text key word according to claim 4 automatically extracts, which is characterized in that the n be 2, i.e., will
After text participle is at binary set, the step of filtering noise, includes:
After segmenting the text at unitary set, the noise in the set is first filtered, then extract the pass in the set
Keyword obtains the highest word frequency max_count in the keyword;
Word frequency is less than or equal to the word of max_count/5 in filtered binary set;
Word of the word frequency less than 2 in filtered binary set;
Word in binary set includes preceding word and rear word, and the minimum word frequency of the preceding word and rear word in unitary set is x, filtering
Word frequency is less than the word of 2x/3 in binary set;
Word against regulation in filtered binary set.
6. the method that text key word according to claim 1 automatically extracts, which is characterized in that described " it is candidate to obtain n member
The step of keyword set " includes:
Obtain n-1 member candidate key set of words;
It will in n-1 member candidate key set of words include identical n-2 member word and the n-2 member word in the position of the keyword
Two different keywords merge, and obtain n-1 member result keyword set, and wherein n is the positive integer greater than 2.
7. the method that text key word according to claim 1 automatically extracts, it is characterised in that:
After merging the keyword in n member candidate key set of words, carries out noise filtering and obtain n+1 member result keyword collection
It closes.
8. the method that text key word according to claim 7 automatically extracts, which is characterized in that it is candidate to merge the n member
The step of keyword in keyword set laggard Row noise filtering includes:
After segmenting the text at unitary set, the noise in the set is first filtered, then extract the pass in the set
Keyword obtains the highest word frequency max_count in the keyword;
Filter the word that the word frequency after n member candidate keywords merge is less than max_count/4;
Filter the word that the word frequency after n member candidate keywords merge is less than 2.
9. a kind of electronic equipment, including memory and processor, the memory is stored with and can run on the processor
Computer program, which is characterized in that the processor realizes text described in claim 1-8 any one when executing described program
The step in method that keyword automatically extracts.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The step in method that text key word described in realization claim 1-8 any one automatically extracts when being executed by processor.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910754155.7A CN110532551A (en) | 2019-08-15 | 2019-08-15 | Method, equipment and the storage medium that text key word automatically extracts |
PCT/CN2019/115115 WO2021027085A1 (en) | 2019-08-15 | 2019-11-01 | Method and device for automatically extracting text keyword, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910754155.7A CN110532551A (en) | 2019-08-15 | 2019-08-15 | Method, equipment and the storage medium that text key word automatically extracts |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110532551A true CN110532551A (en) | 2019-12-03 |
Family
ID=68663358
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910754155.7A Pending CN110532551A (en) | 2019-08-15 | 2019-08-15 | Method, equipment and the storage medium that text key word automatically extracts |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110532551A (en) |
WO (1) | WO2021027085A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488727A (en) * | 2020-03-24 | 2020-08-04 | 南阳柯丽尔科技有限公司 | Word file parsing method, word file parsing apparatus, and computer-readable storage medium |
CN116978384A (en) * | 2023-09-25 | 2023-10-31 | 成都市青羊大数据有限责任公司 | Public security integrated big data management system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1354432A (en) * | 2000-11-17 | 2002-06-19 | 意蓝科技股份有限公司 | Method for automatically-searching key word from file and its system |
CN102375863A (en) * | 2010-08-27 | 2012-03-14 | 北京四维图新科技股份有限公司 | Method and device for keyword extraction in geographic information field |
CN102411563A (en) * | 2010-09-26 | 2012-04-11 | 阿里巴巴集团控股有限公司 | Method, device and system for identifying target words |
CN103092979A (en) * | 2013-01-31 | 2013-05-08 | 中国科学院对地观测与数字地球科学中心 | Processing method and device for searching of natural language by remote sensing data |
CN103678318A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
CN104216875A (en) * | 2014-09-26 | 2014-12-17 | 中国科学院自动化研究所 | Automatic microblog text abstracting method based on unsupervised key bigram extraction |
CN105426539A (en) * | 2015-12-23 | 2016-03-23 | 成都电科心通捷信科技有限公司 | Dictionary-based lucene Chinese word segmentation method |
CN107665191A (en) * | 2017-10-19 | 2018-02-06 | 中国人民解放军陆军工程大学 | A kind of proprietary protocol message format estimating method based on expanded prefix tree |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2003256456A1 (en) * | 2002-07-03 | 2004-01-23 | Word Data Corp. | Text-representation, text-matching and text-classification code, system and method |
CN100520782C (en) * | 2007-11-09 | 2009-07-29 | 清华大学 | News keyword abstraction method based on word frequency and multi-component grammar |
CN106557459B (en) * | 2015-09-24 | 2019-12-27 | 北京神州泰岳软件股份有限公司 | Method and device for extracting new words from work order |
CN105956158B (en) * | 2016-05-17 | 2019-08-09 | 清华大学 | The method that network neologisms based on massive micro-blog text and user information automatically extract |
-
2019
- 2019-08-15 CN CN201910754155.7A patent/CN110532551A/en active Pending
- 2019-11-01 WO PCT/CN2019/115115 patent/WO2021027085A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1354432A (en) * | 2000-11-17 | 2002-06-19 | 意蓝科技股份有限公司 | Method for automatically-searching key word from file and its system |
CN102375863A (en) * | 2010-08-27 | 2012-03-14 | 北京四维图新科技股份有限公司 | Method and device for keyword extraction in geographic information field |
CN102411563A (en) * | 2010-09-26 | 2012-04-11 | 阿里巴巴集团控股有限公司 | Method, device and system for identifying target words |
CN103678318A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
CN103092979A (en) * | 2013-01-31 | 2013-05-08 | 中国科学院对地观测与数字地球科学中心 | Processing method and device for searching of natural language by remote sensing data |
CN104216875A (en) * | 2014-09-26 | 2014-12-17 | 中国科学院自动化研究所 | Automatic microblog text abstracting method based on unsupervised key bigram extraction |
CN105426539A (en) * | 2015-12-23 | 2016-03-23 | 成都电科心通捷信科技有限公司 | Dictionary-based lucene Chinese word segmentation method |
CN107665191A (en) * | 2017-10-19 | 2018-02-06 | 中国人民解放军陆军工程大学 | A kind of proprietary protocol message format estimating method based on expanded prefix tree |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488727A (en) * | 2020-03-24 | 2020-08-04 | 南阳柯丽尔科技有限公司 | Word file parsing method, word file parsing apparatus, and computer-readable storage medium |
CN111488727B (en) * | 2020-03-24 | 2023-09-19 | 南阳柯丽尔科技有限公司 | Word file parsing method, word file parsing apparatus, and computer-readable storage medium |
CN116978384A (en) * | 2023-09-25 | 2023-10-31 | 成都市青羊大数据有限责任公司 | Public security integrated big data management system |
CN116978384B (en) * | 2023-09-25 | 2024-01-02 | 成都市青羊大数据有限责任公司 | Public security integrated big data management system |
Also Published As
Publication number | Publication date |
---|---|
WO2021027085A1 (en) | 2021-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Verma et al. | Tokenization and filtering process in RapidMiner | |
CN102169495B (en) | Industry dictionary generating method and device | |
CN106021272B (en) | The keyword extraction method calculated based on distributed expression term vector | |
CN106294320B (en) | A kind of terminology extraction method and system towards academic paper | |
CN104598532A (en) | Information processing method and device | |
US20130339373A1 (en) | Method and system of filtering and recommending documents | |
Al-Taani et al. | An extractive graph-based Arabic text summarization approach | |
Mizumoto et al. | Sentiment analysis of stock market news with semi-supervised learning | |
Sabuna et al. | Summarizing Indonesian text automatically by using sentence scoring and decision tree | |
CN108170666A (en) | A kind of improved method based on TF-IDF keyword extractions | |
Pabitha et al. | Automatic question generation system | |
Prasad | WING-NUS at CL-SciSumm 2017: Learning from Syntactic and Semantic Similarity for Citation Contextualization. | |
CN110532551A (en) | Method, equipment and the storage medium that text key word automatically extracts | |
Desai et al. | Automatic text summarization using supervised machine learning technique for Hindi langauge | |
Sheeba et al. | Improved keyword and keyphrase extraction from meeting transcripts | |
KR20210062934A (en) | Text document cluster and topic generation apparatus and method thereof | |
Skorkovská | Application of lemmatization and summarization methods in topic identification module for large scale language modeling data filtering | |
Agathangelou et al. | Mining domain-specific dictionaries of opinion words | |
KR101120038B1 (en) | Neologism selection apparatus and its method | |
KR101290439B1 (en) | Method for summerizing meeting minutes based on sentence network | |
CN104166712A (en) | Method and system for scientific and technical literature retrieval | |
Juric et al. | Discovering links between political debates and media | |
Al-Thwaib | Text Summarization as Feature Selection for Arabic Text Classification. | |
Rostami et al. | Proposing a method to classify texts using data mining | |
CN115455975A (en) | Method and device for extracting topic keywords based on multi-model fusion decision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20191203 |
|
WD01 | Invention patent application deemed withdrawn after publication |