CN106997344A - Keyword abstraction system - Google Patents

Keyword abstraction system Download PDF

Info

Publication number
CN106997344A
CN106997344A CN201710211226.XA CN201710211226A CN106997344A CN 106997344 A CN106997344 A CN 106997344A CN 201710211226 A CN201710211226 A CN 201710211226A CN 106997344 A CN106997344 A CN 106997344A
Authority
CN
China
Prior art keywords
keyword
speech
word
module
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710211226.XA
Other languages
Chinese (zh)
Inventor
罗镇权
罗强
刘世林
练睿
闫俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Business Big Data Technology Co Ltd
Original Assignee
Chengdu Business Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Business Big Data Technology Co Ltd filed Critical Chengdu Business Big Data Technology Co Ltd
Priority to CN201710211226.XA priority Critical patent/CN106997344A/en
Publication of CN106997344A publication Critical patent/CN106997344A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to natural language processing field, more particularly to keyword abstraction system;Pretreatment module, term vector modular converter, part-of-speech tagging module, part of speech screening module and candidate word weight computation module;System limits the part of speech of keyword extraction by part of speech screening, and then adjust the direction of keyword abstraction, and train term vector by introducing Large Scale Corpus, dependent on the term vector trained in Large Scale Corpus, weight of the candidate in document to be extracted is calculated by way of COS distance and IF IDF weights are combined, the investigation scope of keyword abstraction is expanded by the introducing of outside corpus, so that the result of keyword abstraction is more reasonable, new tool is provided for effective extraction of keyword.

Description

Keyword abstraction system
Technical field
Natural language processing field of the present invention, more particularly to keyword abstraction system.
Background technology
As the fast development of internet is with the arrival in big data epoch, in actual life, substantial amounts of human contact All it is to be existed with electrical file form to information, in face of these vast as the open sea information, people can be automatic in the urgent need to machine The keyword of article purport can most be represented by identifying, help people to understand article main contents faster, saved people and read, place Reason and the time using these electronic documents.
Current this technology is referred to as keyword abstraction (Keyword Extraction), and keyword abstraction is referred to quickly Obtained from document it is multiple can represent the word or phrase of document subject matter, it is general as a kind of refining to the document main contents Condition.It can quickly understand the main contents of document by keyword people, efficiently hold document subject matter.Keyword extensively should For wide spectrums such as news report, technical papers, by that can allow people's efficiently management and retrieval document.With information with The growth of index step velocity, keyword turns into the important and main instrument that user retrieves content of interest in magnanimity information, People's search engine used in everyday is all operated by keyword.Keyword in the document of different time sections is used Change in terms of frequency, inherent meaning also turns into the important way that research human society, economy, culture and political conception are developed Footpath.
For the prior art of keyword abstraction, a kind of approach is to use unsupervised method, utilizes candidate keywords They are sorted by statistical property, such as (TFIDF), choose highest several as keyword, but this mode this be single Pure make use of statistical property, and document subject matter is found just with the degree of polymerization of inside documents information, that is, word.The party Method is disadvantageous in that the Limited information of a document, often can not be to find that document subject matter provides enough information, one In a little documents, indivedual very important keywords, although the frequency of appearance is relatively low, but the reaction of the theme for article There is very important effect, at this moment, only by statistical method, these words cannot be extracted.In such situation Down, it is necessary to which a kind of new keyword automatically extracts instrument, to make the extraction result of keyword more reasonable.
The content of the invention
It is above-mentioned not enough there is provided keyword abstraction system in the presence of prior art it is an object of the invention to overcome, it is real Now to the automatic extraction of keyword, the system when carrying out keyword abstraction by part-of-speech information to document in after participle word enter Row screening, it is ensured that the draw-off direction of keyword, and improve computational efficiency;More external informations are introduced on this basis for document Keyword abstraction provide support, expanded keyword extraction investigate scope so that the result of keyword abstraction is more reasonable.
To realize the goal of the invention, the present invention provides following technical scheme:
Keyword abstraction system, the system includes:Pretreatment module, term vector modular converter, part-of-speech tagging module, word Property screening module and candidate word weight computation module;
The pretreatment module carries out including participle, removes high frequency to the first corpus, the second corpus and document to be extracted Word, the processing for removing stop words;
Word in the first corpus after pretreatment module processing is converted into by the term vector modular converter Vector;
The part-of-speech tagging module carries out part of speech to the word in the document to be extracted after pretreatment module processing Mark;
The part of speech screening module, is screened to the word in the document to be extracted after part-of-speech tagging, is only retained and is set The word of part of speech as keyword candidate word;
The candidate word weight computation module, is calculated the weight of candidate word in a document, is retained and is set number Word as the document keyword;
The candidate word weight computation module is realized using below equation to realizing the calculating to candidate word weight:
Wherein Pr (t | d) it is weighted values of the current candidate word t in keyword document to be extracted;Pr (w | d) is word w in text TF-IDF values in shelves d;And Pr (t | w) it is current candidate word and other word COS distance sums in document.
Further, the system also includes the first corpus memory module and the second corpus memory module, described the The second language materials of number of documents > place that one corpus is included include number of documents 10 times.
As a preferred embodiment, the term vector modular converter uses Word2vec to the word in the first corpus after participle Enter row vector conversion.
Further, the part of speech screening module includes the part of speech setup module for retaining keyword part of speech;The part of speech Setup module provide user carry out the interface that crucial part of speech is set, system in use, user can by part of speech setup module come Selection needs the keyword part of speech extracted.The part of speech setup module can be supplied to user by the way of frame is selected.
Further, the default setting of part of speech setup module is reservation nr names, nrf transliteration name, nw neologisms, nt mechanisms Group's name, a adjectives, nz other specific terms, v verbs, n nouns, ns place names, when user does not carry out keyword part of speech setting When, system retains candidate word of the keyword as keyword of correspondence part of speech according to default setting.
Further, the candidate word weight computation module includes the setting mould for needing to export keyword number or threshold value Block, user inputs arranges value to complete the setting to keyword extraction or threshold value in setup module.
Further, the setup module has default setting, when user is not selected, the weight calculation mould Root tuber exports corresponding keyword according to default setting.
Further, the system is to be loaded with computer, server or the shifting of the keyword extraction function program Dynamic intelligent terminal.
Compared with prior art, beneficial effects of the present invention:The present invention provides keyword abstraction system, the system bag Include:Pretreatment module, term vector modular converter, part-of-speech tagging module, part of speech screening module and candidate word weight computation module; Based on the system realize keyword abstraction compared to existing keyword abstraction technology, by part-of-speech information to participle after Candidate word is screened, it is ensured that the draw-off direction of keyword, more with the specific aim to analysis directions, and improves calculating effect Rate;Moreover, the system is by selecting large-scale first corpus, by Word2vec come to Large Scale Corpus The vector conversion of middle word segmentation result, the term vector after conversion is with semantic approximate with other words and common in large-scale corpus With the relation of frequency of occurrence, when calculating the weight of candidate word, using the COS distance of candidate word and other words as Consideration, More external informations are introduced for the keyword abstraction of keyword document to be extracted by term vector, have expanded examining for keyword extraction Scope is examined, is overcome when document information amount to be extracted is few, prior art extracts the technological deficiency of effect difference.At natural language In reason, the extraction of keyword provides new instrument.
Brief description of the drawings:
Fig. 1 is the system construction drawing of this keyword abstraction system.
Embodiment
With reference to test example and embodiment, the present invention is described in further detail.But this should not be understood Following embodiment is only limitted to for the scope of above-mentioned theme of the invention, it is all that this is belonged to based on the technology that present invention is realized The scope of invention.
Keyword abstraction system is provided, compared to existing keyword abstraction technology, more Considerations are introduced, by Part-of-speech information is screened to the candidate word after participle, it is ensured that the draw-off direction of keyword, and improves computational efficiency;In this base More external informations are introduced on plinth and provide support for the keyword extraction of document, scope is investigated in the extraction for having expanded keyword, is made The result collected of keyword is more reasonable.
To realize the goal of the invention, the present invention provides following technical scheme:
Keyword abstraction system, the system as shown in Figure 1 includes:Pretreatment module, term vector modular converter, part of speech mark Injection molding block, part of speech screening module and candidate word weight computation module.
The pretreatment module carries out including participle, removes high frequency to the first corpus, the second corpus and document to be extracted Word, the processing for removing stop words;The participle instrument that target can be used is also a lot, such as Stamford participle instrument, Harbin Institute of Technology LTP, Also there are the participle instrument of internal R&D in Computer Department of the Chinese Academy of Science NLPIR, Tsing-Hua University THULAC and jieba, many enterprises.
Word in the first corpus after pretreatment module processing is converted into by the term vector modular converter Vector;Apply very extensive word2vec at present to realize that the vector of word after corpus participle turns as a preferred embodiment, can be used Change, the vector conversion for the word that word2vec is realized can embody approximation relation semantic between word and word and the co-occurrence frequency is closed System, word that can be by meaning in document relatively, or the higher word of the co-occurrence frequency in a document, are converted into locus Closer vector.
The part-of-speech tagging module carries out part of speech to the word in the document to be extracted after pretreatment module processing Mark;The instrument of current part-of-speech tagging is a lot, and the vocabulary after participle is labeled using part-of-speech tagging instrument, is word-based Property candidate word screening prepared premise.
The part of speech screening module, is screened to the word in the document to be extracted after part-of-speech tagging, is only retained and is set The word of part of speech as keyword candidate word.The word in keyword document to be extracted is sieved according to the result of part-of-speech tagging Choosing, the vocabulary for only retaining setting part of speech is used as the candidate word of keyword.
Further, the part of speech screening module includes the setup module for retaining keyword part of speech, in use, user can The part of speech of keyword is set with the direction according to analysis;Understand that a document needs the key obtained with different analysis directions Word may also be different, and the keyword that existing keyword abstraction instrument is extracted lacks the specific aim to analysis directions;The present invention System can need the direction analyzed to set the part of speech of keyword to be extracted according to user;And then extract the pass of correspondence part of speech Keyword, the specific aim to analysis directions is stronger;And because present system is sieved by the part of speech of setting to candidate word Choosing, so in the calculating process in later stage, is only carried out to retaining vocabulary;Amount of calculation is reduced, the efficiency of calculating is improved.
The part of speech setup module can be supplied to user in the form of the mode of frame is selected with web port.
Further, the default setting of part of speech setup module is reservation nr names, nrf transliteration name, nw neologisms, nt mechanisms Group's name, a adjectives, nz other specific terms, v verbs, n nouns, ns place names, when user does not carry out keyword part of speech setting When, system retains candidate word of the keyword as keyword of correspondence part of speech according to default setting.Increase with default setting and be The versatility of system, system can also export corresponding knot according to the default setting of system when user need not carry out special setting Really.
The system also includes the first corpus memory module and the second corpus memory module, the first corpus bag The second language materials of the number of documents > place contained include number of documents 10 times.Such as say in the first corpus comprising 100000 texts 1000 documents are included in shelves, the second corpus.First corpus is used for training term vector as external information introducing, thus the The document that one language material place is included is abundanter comprehensively, and the investigation scope for external information is bigger.
The candidate word weight computation module, is calculated the weight of candidate word in a document, is retained and is set number Word as the document keyword;
The weight calculation formula of the candidate word is as follows:
Wherein Pr (t | d) is weighted values of the current candidate word t in keyword document to be extracted, in each word t, Pr (w | D) be word w in document d (word w be after keyword document participle to be extracted through past high frequency words, remove the preprocessing process such as stop words after Remaining all words, not only keyword candidate word) weight, normalized TF-IDF values can be used, each word is being calculated TF-IDF values when, it is necessary to select the second corpus;Now in corpus the selection of document according to keyword document to be extracted Situation is carried out, and is typically chosen the document close with keyword Doctype to be extracted, such as keyword document to be extracted is News category, then the document of the second corpus also corresponding selection news category, the close document of Selective type, according to TF-IDF's The distinction of candidate word can be more embodied for principle;
Specifically, TF is all word occurrence number sums of the word w in document d in occurrence number divided by document d;IDF is Reverse document-frequency:
When calculating TF-IDF values of the word w in document d, it is necessary to introduce the second corpus D, | D | in the second corpus Comprising total number of documents, | { d ∈ D:W ∈ d } | to include word w number of documents in the second corpus D.TF-IDF is natural language The mature technology of term weighing in document is calculated in processing, its ins and outs will not be repeated here.
And Pr (t | w) be current candidate word with other words (be after keyword document participle to be extracted through past high frequency words, go It is remaining after the preprocessing process such as stop words, other words in addition to current word) (COS distance is also referred to as remaining COS distance sum String similarity, is the degree with two vectorial angle cosine values in vector space as the size for weighing two interindividual variations Amount).By the word2vec vector conversions to the first corpus word segmentation result, the term vector come is trained by word2vec and is had Have of overall importance on the first domain lexicon, each word correspondence one is unique vectorial, and the word and other words are embodied in vector The meaning of a word is far and near, common frequency of occurrence information, such as excuse A is very high with the common frequency of occurrences of word B in the first domain lexicon, The term vector B that the term vector A that so word A is changed into is changed into word B is spatially more nearly, and its COS distance is just bigger, This is that the weight that candidate word is calculated by COS distance provides the foundation.It regard the COS distance of current word and other words as time The considerations of word weight calculation are selected, the reference factor of the more external informations cleverly introduced;So when some words are being waited to take out Take frequency of occurrence in keyword document very low, can not be come out by prior art as keyword abstraction, present system The first corpus is introduced, if the vocabulary has the very high co-occurrence frequency in the first corpus with other vocabulary, then with it His COS distance of word is also larger, relies on TF-IDF to calculate the deficiency of keyword weight supplemented with simple so that keyword is taken out The weight calculation formula taken is more reasonable.
Further, the system is to be loaded with computer, server or the shifting of the keyword extraction function program Dynamic intelligent terminal.
Embodiment 1
By in the keyword extraction system of the following text input present invention, keyword abstraction is carried out, " Europe stock is low to open European master Will be stopped business on national Christmas arrangement guide look.China Securities net news Europe stock today it is low open, the index of stoke 600 falls 0.1% to 366.12 Point;French CAC indexes fall 0.1% to 4669.76 point;100 indexes fall 0.2% to 6253.29 point when Britain is rich;German stock market is modern Day stops business.The arrangement of stopping business of the Christmas of European major country is not quite similar, and according to certain finance and economics message, stock market of Britain stops business because of vacation on Christmas Three and half, closing quotation half a day in advance on the 24th, the whole day of (Christmas Day) on the 25th is stopped business, and (Boxing Day) on the 26th fell in Saturday, need to be mended on 28th Not, English stock to (Tuesday) on the 29th can just merchandise once again;French stock market stops business one day because of vacation on Christmas, is to stop business December 25;Moral Stock market of state stopped business two days -25 days on the 24th December ".
After participle and part-of-speech tagging, setting retains part of speech and is:Nr names, nrf transliteration name, nw neologisms, nt mechanisms The nominal corresponding word in group's name, a adjectives, nz other specific terms, v verbs, n nouns, ns ground;Taken out by present system The keyword of taking-up is:It is low to open | | Europe stock | | Christmas stops business | | Europe | | country }, the keyword extracted compared to textrank: Stop business | | Christmas | | stock market | | today | | France }, the keyword that present system is extracted can more embody main body " Europe stock ", " low Open " etc. word, the frequency that " Europe stock " occurs in a document is only once, the frequency of occurrences is relatively low, but is opened in full around Europe stock is low To deploy, the theme for reacting document is played a very important role, for this class keywords, existing technology can not extracted typically Come, and the better extract of this class keywords is realized using present system;The extraction result of keyword is more reasonable.
Embodiment 2
By in the keyword extraction system of the following text input present invention, keyword abstraction is carried out, " so-and-so share is accused of disobeying Anti- Securities Market Law, investigates so-and-so share (600***) night on the 25th bulletin, 25, company received stock supervisory committee by stock supervisory committee《Investigation is logical Know book》.Because company is accused of violating Securities Act regulation, according to the pertinent regulations of Securities Market Law, stock supervisory committee determines tune of being put on record to company Look into." keyword abstraction result it is as follows:So-and-so share | | Securities Market Law | | investigation | | stock supervisory committee | | company | | }, and by existing Textrank technologies extract keyword results be:Stock supervisory committee | | company | | investigation | | Securities Market Law | | violate | | }.The present invention System has preferably extracted descriptor as so-and-so share, and the result extracted relative to textrank methods is better able to instead Reflect the theme of document;It is worth noting that the parameter that the part of speech that need to once retain and other needs are set is determined, present invention system System is exactly to belong to unsupervised learning process, and extraction efficiency is higher;But extraction result is also taken out compared to supervised learning and manually A certain distance is taken;But this does not influence present system compared to the technological progress of the keyword technology of existing unsupervised learning Property.

Claims (9)

1. keyword abstraction system, it is characterised in that the system includes:Pretreatment module, term vector modular converter, part of speech mark Injection molding block, part of speech screening module and candidate word weight computation module;
The pretreatment module includes participle to the progress of the first corpus, the second corpus and document to be extracted, goes high frequency words, goes The processing of stop words;
Word in the first corpus after pretreatment module processing is converted into vector by the term vector modular converter;
The part-of-speech tagging module carries out part-of-speech tagging to the word in the document to be extracted after pretreatment module processing;
The part of speech screening module, is screened to the word in the document to be extracted after part-of-speech tagging, is only retained and is set part of speech Word as keyword candidate word;
The candidate word weight computation module, is calculated the weight of candidate word in a document, is retained and is set the word of number to make For the keyword of the document.
2. the system as claimed in claim 1, it is characterised in that:The candidate word weight computation module is using below equation come real Now to realizing the calculating to candidate word weight:
Pr ( t | d ) = Σ w ∈ d Pr ( t | w ) Pr ( w | d )
Wherein Pr (t | d) it is weighted values of the current candidate word t in keyword document to be extracted;Pr (w | d) is word w in document d In TF-IDF values;And Pr (t | w) it is current candidate word and other word COS distance sums in document.
3. the system as shown in claim 2, it is characterised in that:The system also includes the first corpus memory module and second Corpus memory module, the second language materials of number of documents > place that first corpus is included include number of documents 10 times.
4. system as claimed in claim 3, it is characterised in that after the term vector modular converter uses Word2vec to participle The first corpus in word enter row vector conversion.
5. system as claimed in claim 4, it is characterised in that the part of speech screening module includes retaining part of speech to keyword Part of speech setup module, system are in use, user selects the keyword part of speech for needing to extract by part of speech setup module.
6. system as claimed in claim 5, it is characterised in that the default setting of the part of speech setup module is reservation nr people Name, nrf transliteration name, nw neologisms, group of nt mechanisms name, a adjectives, nz other specific terms, v verbs, n nouns, ns place names.
7. system as claimed in claim 6, it is characterised in that the candidate word weight computation module includes needing to export crucial The setup module of word number or threshold value, user inputs arranges value to complete to keyword extraction or threshold value in setup module Setting.
8. system as claimed in claim 7, it is characterised in that the setup module has default setting, when user is not right Keyword number or when being configured of threshold value, the weight computation module export corresponding keyword according to default setting.
9. system as claimed in claim 8, it is characterised in that the system is loading just like described in one of claim 1 to 7 Computer, server or the mobile intelligent terminal of function program.
CN201710211226.XA 2017-03-31 2017-03-31 Keyword abstraction system Pending CN106997344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710211226.XA CN106997344A (en) 2017-03-31 2017-03-31 Keyword abstraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710211226.XA CN106997344A (en) 2017-03-31 2017-03-31 Keyword abstraction system

Publications (1)

Publication Number Publication Date
CN106997344A true CN106997344A (en) 2017-08-01

Family

ID=59435732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710211226.XA Pending CN106997344A (en) 2017-03-31 2017-03-31 Keyword abstraction system

Country Status (1)

Country Link
CN (1) CN106997344A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec
CN108052630A (en) * 2017-12-19 2018-05-18 中山大学 It is a kind of that the method for expanding word is extracted based on Chinese education video
CN108376131A (en) * 2018-03-14 2018-08-07 中山大学 Keyword abstraction method based on seq2seq deep neural network models
CN108388597A (en) * 2018-02-01 2018-08-10 深圳市鹰硕技术有限公司 Conference summary generation method and device
CN108460018A (en) * 2018-02-28 2018-08-28 首都师范大学 A kind of Chinese chapter theme expression power analysis method based on syntax predicate cluster
CN108549625A (en) * 2018-02-28 2018-09-18 首都师范大学 A kind of Chinese chapter Behaviour theme analysis method based on syntax object cluster
CN108595425A (en) * 2018-04-20 2018-09-28 昆明理工大学 Based on theme and semantic dialogue language material keyword abstraction method
CN109800219A (en) * 2019-01-18 2019-05-24 广东小天才科技有限公司 A kind of method and apparatus of corpus cleaning
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN111046169A (en) * 2019-12-24 2020-04-21 东软集团股份有限公司 Method, device and equipment for extracting subject term and storage medium
CN112364624A (en) * 2020-11-04 2021-02-12 重庆邮电大学 Keyword extraction method based on deep learning language model fusion semantic features
CN113486155A (en) * 2021-07-28 2021-10-08 国际关系学院 Chinese naming method fusing fixed phrase information
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375842A (en) * 2010-08-20 2012-03-14 姚尹雄 Method for evaluating and extracting keyword set in whole field
CN103106275A (en) * 2013-02-08 2013-05-15 西北工业大学 Text classification character screening method based on character distribution information
CN104199846A (en) * 2014-08-08 2014-12-10 杭州电子科技大学 Comment subject term clustering method based on Wikipedia
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
CN105159932A (en) * 2015-08-07 2015-12-16 南车青岛四方机车车辆股份有限公司 Data retrieving and sorting system and method
CN105243152A (en) * 2015-10-26 2016-01-13 同济大学 Graph model-based automatic abstracting method
CN106021272A (en) * 2016-04-04 2016-10-12 上海大学 Keyword automatic extraction method based on distributed expression word vector calculation
CN106294320A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 A kind of terminology extraction method and system towards scientific paper
CN106502994A (en) * 2016-11-29 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and apparatus of the keyword extraction of text

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375842A (en) * 2010-08-20 2012-03-14 姚尹雄 Method for evaluating and extracting keyword set in whole field
CN103106275A (en) * 2013-02-08 2013-05-15 西北工业大学 Text classification character screening method based on character distribution information
CN104199846A (en) * 2014-08-08 2014-12-10 杭州电子科技大学 Comment subject term clustering method based on Wikipedia
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
CN105159932A (en) * 2015-08-07 2015-12-16 南车青岛四方机车车辆股份有限公司 Data retrieving and sorting system and method
CN105243152A (en) * 2015-10-26 2016-01-13 同济大学 Graph model-based automatic abstracting method
CN106021272A (en) * 2016-04-04 2016-10-12 上海大学 Keyword automatic extraction method based on distributed expression word vector calculation
CN106294320A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 A kind of terminology extraction method and system towards scientific paper
CN106502994A (en) * 2016-11-29 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and apparatus of the keyword extraction of text

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec
CN108052630A (en) * 2017-12-19 2018-05-18 中山大学 It is a kind of that the method for expanding word is extracted based on Chinese education video
CN108052630B (en) * 2017-12-19 2020-12-08 中山大学 Method for extracting expansion words based on Chinese education videos
CN108388597A (en) * 2018-02-01 2018-08-10 深圳市鹰硕技术有限公司 Conference summary generation method and device
CN108460018B (en) * 2018-02-28 2020-11-06 首都师范大学 Chinese chapter theme expression analysis method based on syntactic predicate clustering
CN108549625A (en) * 2018-02-28 2018-09-18 首都师范大学 A kind of Chinese chapter Behaviour theme analysis method based on syntax object cluster
CN108460018A (en) * 2018-02-28 2018-08-28 首都师范大学 A kind of Chinese chapter theme expression power analysis method based on syntax predicate cluster
CN108549625B (en) * 2018-02-28 2020-11-17 首都师范大学 Chinese chapter expression theme analysis method based on syntactic object clustering
CN108376131A (en) * 2018-03-14 2018-08-07 中山大学 Keyword abstraction method based on seq2seq deep neural network models
CN108595425A (en) * 2018-04-20 2018-09-28 昆明理工大学 Based on theme and semantic dialogue language material keyword abstraction method
CN109800219A (en) * 2019-01-18 2019-05-24 广东小天才科技有限公司 A kind of method and apparatus of corpus cleaning
CN110298033B (en) * 2019-05-29 2022-07-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling training extraction system
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN111046169A (en) * 2019-12-24 2020-04-21 东软集团股份有限公司 Method, device and equipment for extracting subject term and storage medium
CN111046169B (en) * 2019-12-24 2024-03-26 东软集团股份有限公司 Method, device, equipment and storage medium for extracting subject term
CN112364624A (en) * 2020-11-04 2021-02-12 重庆邮电大学 Keyword extraction method based on deep learning language model fusion semantic features
CN112364624B (en) * 2020-11-04 2023-09-26 重庆邮电大学 Keyword extraction method based on deep learning language model fusion semantic features
CN113486155A (en) * 2021-07-28 2021-10-08 国际关系学院 Chinese naming method fusing fixed phrase information
CN113486155B (en) * 2021-07-28 2022-05-20 国际关系学院 Chinese naming method fusing fixed phrase information
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN113743090B (en) * 2021-09-08 2024-04-12 度小满科技(北京)有限公司 Keyword extraction method and device

Similar Documents

Publication Publication Date Title
CN106997344A (en) Keyword abstraction system
Kwaik et al. Shami: A corpus of levantine arabic dialects
Schmitz Inducing ontology from flickr tags
Hammond et al. Semantic enhancement engine: A modular document enhancement platform for semantic applications over heterogeneous content
Ahmed et al. Language identification from text using n-gram based cumulative frequency addition
Saravanan et al. Identification of rhetorical roles for segmentation and summarization of a legal judgment
CN106021272A (en) Keyword automatic extraction method based on distributed expression word vector calculation
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN112380848B (en) Text generation method, device, equipment and storage medium
Jayaram et al. A review: Information extraction techniques from research papers
Zhang et al. A Chinese question-answering system with question classification and answer clustering
Singh et al. Writing Style Change Detection on Multi-Author Documents.
CN106997345A (en) The keyword abstraction method of word-based vector sum word statistical information
Mollaei et al. Question classification in Persian language based on conditional random fields
Hakkani-Tur et al. Statistical sentence extraction for information distillation
Iacobelli et al. Finding new information via robust entity detection
Hamdi et al. Machine learning vs deterministic rule-based system for document stream segmentation
Hasan et al. Pattern-matching based for Arabic question answering: a challenge perspective
Kim et al. Question Answering Considering Semantic Categories and Co-Occurrence Density.
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Goweder et al. Identifying broken plurals in unvowelised arabic tex
Eghbalzadeh et al. Persica: A Persian corpus for multi-purpose text mining and Natural language processing
Sirajzade et al. The LuNa Open Toolbox for the Luxembourgish Language
Kaur et al. Keyword extraction for punjabi language
CN113590738A (en) Method for detecting network sensitive information based on content and emotion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170801

WD01 Invention patent application deemed withdrawn after publication