CN106997344A - Keyword abstraction system - Google Patents
Keyword abstraction system Download PDFInfo
- Publication number
- CN106997344A CN106997344A CN201710211226.XA CN201710211226A CN106997344A CN 106997344 A CN106997344 A CN 106997344A CN 201710211226 A CN201710211226 A CN 201710211226A CN 106997344 A CN106997344 A CN 106997344A
- Authority
- CN
- China
- Prior art keywords
- keyword
- speech
- word
- module
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to natural language processing field, more particularly to keyword abstraction system;Pretreatment module, term vector modular converter, part-of-speech tagging module, part of speech screening module and candidate word weight computation module;System limits the part of speech of keyword extraction by part of speech screening, and then adjust the direction of keyword abstraction, and train term vector by introducing Large Scale Corpus, dependent on the term vector trained in Large Scale Corpus, weight of the candidate in document to be extracted is calculated by way of COS distance and IF IDF weights are combined, the investigation scope of keyword abstraction is expanded by the introducing of outside corpus, so that the result of keyword abstraction is more reasonable, new tool is provided for effective extraction of keyword.
Description
Technical field
Natural language processing field of the present invention, more particularly to keyword abstraction system.
Background technology
As the fast development of internet is with the arrival in big data epoch, in actual life, substantial amounts of human contact
All it is to be existed with electrical file form to information, in face of these vast as the open sea information, people can be automatic in the urgent need to machine
The keyword of article purport can most be represented by identifying, help people to understand article main contents faster, saved people and read, place
Reason and the time using these electronic documents.
Current this technology is referred to as keyword abstraction (Keyword Extraction), and keyword abstraction is referred to quickly
Obtained from document it is multiple can represent the word or phrase of document subject matter, it is general as a kind of refining to the document main contents
Condition.It can quickly understand the main contents of document by keyword people, efficiently hold document subject matter.Keyword extensively should
For wide spectrums such as news report, technical papers, by that can allow people's efficiently management and retrieval document.With information with
The growth of index step velocity, keyword turns into the important and main instrument that user retrieves content of interest in magnanimity information,
People's search engine used in everyday is all operated by keyword.Keyword in the document of different time sections is used
Change in terms of frequency, inherent meaning also turns into the important way that research human society, economy, culture and political conception are developed
Footpath.
For the prior art of keyword abstraction, a kind of approach is to use unsupervised method, utilizes candidate keywords
They are sorted by statistical property, such as (TFIDF), choose highest several as keyword, but this mode this be single
Pure make use of statistical property, and document subject matter is found just with the degree of polymerization of inside documents information, that is, word.The party
Method is disadvantageous in that the Limited information of a document, often can not be to find that document subject matter provides enough information, one
In a little documents, indivedual very important keywords, although the frequency of appearance is relatively low, but the reaction of the theme for article
There is very important effect, at this moment, only by statistical method, these words cannot be extracted.In such situation
Down, it is necessary to which a kind of new keyword automatically extracts instrument, to make the extraction result of keyword more reasonable.
The content of the invention
It is above-mentioned not enough there is provided keyword abstraction system in the presence of prior art it is an object of the invention to overcome, it is real
Now to the automatic extraction of keyword, the system when carrying out keyword abstraction by part-of-speech information to document in after participle word enter
Row screening, it is ensured that the draw-off direction of keyword, and improve computational efficiency;More external informations are introduced on this basis for document
Keyword abstraction provide support, expanded keyword extraction investigate scope so that the result of keyword abstraction is more reasonable.
To realize the goal of the invention, the present invention provides following technical scheme:
Keyword abstraction system, the system includes:Pretreatment module, term vector modular converter, part-of-speech tagging module, word
Property screening module and candidate word weight computation module;
The pretreatment module carries out including participle, removes high frequency to the first corpus, the second corpus and document to be extracted
Word, the processing for removing stop words;
Word in the first corpus after pretreatment module processing is converted into by the term vector modular converter
Vector;
The part-of-speech tagging module carries out part of speech to the word in the document to be extracted after pretreatment module processing
Mark;
The part of speech screening module, is screened to the word in the document to be extracted after part-of-speech tagging, is only retained and is set
The word of part of speech as keyword candidate word;
The candidate word weight computation module, is calculated the weight of candidate word in a document, is retained and is set number
Word as the document keyword;
The candidate word weight computation module is realized using below equation to realizing the calculating to candidate word weight:
Wherein Pr (t | d) it is weighted values of the current candidate word t in keyword document to be extracted;Pr (w | d) is word w in text
TF-IDF values in shelves d;And Pr (t | w) it is current candidate word and other word COS distance sums in document.
Further, the system also includes the first corpus memory module and the second corpus memory module, described the
The second language materials of number of documents > place that one corpus is included include number of documents 10 times.
As a preferred embodiment, the term vector modular converter uses Word2vec to the word in the first corpus after participle
Enter row vector conversion.
Further, the part of speech screening module includes the part of speech setup module for retaining keyword part of speech;The part of speech
Setup module provide user carry out the interface that crucial part of speech is set, system in use, user can by part of speech setup module come
Selection needs the keyword part of speech extracted.The part of speech setup module can be supplied to user by the way of frame is selected.
Further, the default setting of part of speech setup module is reservation nr names, nrf transliteration name, nw neologisms, nt mechanisms
Group's name, a adjectives, nz other specific terms, v verbs, n nouns, ns place names, when user does not carry out keyword part of speech setting
When, system retains candidate word of the keyword as keyword of correspondence part of speech according to default setting.
Further, the candidate word weight computation module includes the setting mould for needing to export keyword number or threshold value
Block, user inputs arranges value to complete the setting to keyword extraction or threshold value in setup module.
Further, the setup module has default setting, when user is not selected, the weight calculation mould
Root tuber exports corresponding keyword according to default setting.
Further, the system is to be loaded with computer, server or the shifting of the keyword extraction function program
Dynamic intelligent terminal.
Compared with prior art, beneficial effects of the present invention:The present invention provides keyword abstraction system, the system bag
Include:Pretreatment module, term vector modular converter, part-of-speech tagging module, part of speech screening module and candidate word weight computation module;
Based on the system realize keyword abstraction compared to existing keyword abstraction technology, by part-of-speech information to participle after
Candidate word is screened, it is ensured that the draw-off direction of keyword, more with the specific aim to analysis directions, and improves calculating effect
Rate;Moreover, the system is by selecting large-scale first corpus, by Word2vec come to Large Scale Corpus
The vector conversion of middle word segmentation result, the term vector after conversion is with semantic approximate with other words and common in large-scale corpus
With the relation of frequency of occurrence, when calculating the weight of candidate word, using the COS distance of candidate word and other words as Consideration,
More external informations are introduced for the keyword abstraction of keyword document to be extracted by term vector, have expanded examining for keyword extraction
Scope is examined, is overcome when document information amount to be extracted is few, prior art extracts the technological deficiency of effect difference.At natural language
In reason, the extraction of keyword provides new instrument.
Brief description of the drawings:
Fig. 1 is the system construction drawing of this keyword abstraction system.
Embodiment
With reference to test example and embodiment, the present invention is described in further detail.But this should not be understood
Following embodiment is only limitted to for the scope of above-mentioned theme of the invention, it is all that this is belonged to based on the technology that present invention is realized
The scope of invention.
Keyword abstraction system is provided, compared to existing keyword abstraction technology, more Considerations are introduced, by
Part-of-speech information is screened to the candidate word after participle, it is ensured that the draw-off direction of keyword, and improves computational efficiency;In this base
More external informations are introduced on plinth and provide support for the keyword extraction of document, scope is investigated in the extraction for having expanded keyword, is made
The result collected of keyword is more reasonable.
To realize the goal of the invention, the present invention provides following technical scheme:
Keyword abstraction system, the system as shown in Figure 1 includes:Pretreatment module, term vector modular converter, part of speech mark
Injection molding block, part of speech screening module and candidate word weight computation module.
The pretreatment module carries out including participle, removes high frequency to the first corpus, the second corpus and document to be extracted
Word, the processing for removing stop words;The participle instrument that target can be used is also a lot, such as Stamford participle instrument, Harbin Institute of Technology LTP,
Also there are the participle instrument of internal R&D in Computer Department of the Chinese Academy of Science NLPIR, Tsing-Hua University THULAC and jieba, many enterprises.
Word in the first corpus after pretreatment module processing is converted into by the term vector modular converter
Vector;Apply very extensive word2vec at present to realize that the vector of word after corpus participle turns as a preferred embodiment, can be used
Change, the vector conversion for the word that word2vec is realized can embody approximation relation semantic between word and word and the co-occurrence frequency is closed
System, word that can be by meaning in document relatively, or the higher word of the co-occurrence frequency in a document, are converted into locus
Closer vector.
The part-of-speech tagging module carries out part of speech to the word in the document to be extracted after pretreatment module processing
Mark;The instrument of current part-of-speech tagging is a lot, and the vocabulary after participle is labeled using part-of-speech tagging instrument, is word-based
Property candidate word screening prepared premise.
The part of speech screening module, is screened to the word in the document to be extracted after part-of-speech tagging, is only retained and is set
The word of part of speech as keyword candidate word.The word in keyword document to be extracted is sieved according to the result of part-of-speech tagging
Choosing, the vocabulary for only retaining setting part of speech is used as the candidate word of keyword.
Further, the part of speech screening module includes the setup module for retaining keyword part of speech, in use, user can
The part of speech of keyword is set with the direction according to analysis;Understand that a document needs the key obtained with different analysis directions
Word may also be different, and the keyword that existing keyword abstraction instrument is extracted lacks the specific aim to analysis directions;The present invention
System can need the direction analyzed to set the part of speech of keyword to be extracted according to user;And then extract the pass of correspondence part of speech
Keyword, the specific aim to analysis directions is stronger;And because present system is sieved by the part of speech of setting to candidate word
Choosing, so in the calculating process in later stage, is only carried out to retaining vocabulary;Amount of calculation is reduced, the efficiency of calculating is improved.
The part of speech setup module can be supplied to user in the form of the mode of frame is selected with web port.
Further, the default setting of part of speech setup module is reservation nr names, nrf transliteration name, nw neologisms, nt mechanisms
Group's name, a adjectives, nz other specific terms, v verbs, n nouns, ns place names, when user does not carry out keyword part of speech setting
When, system retains candidate word of the keyword as keyword of correspondence part of speech according to default setting.Increase with default setting and be
The versatility of system, system can also export corresponding knot according to the default setting of system when user need not carry out special setting
Really.
The system also includes the first corpus memory module and the second corpus memory module, the first corpus bag
The second language materials of the number of documents > place contained include number of documents 10 times.Such as say in the first corpus comprising 100000 texts
1000 documents are included in shelves, the second corpus.First corpus is used for training term vector as external information introducing, thus the
The document that one language material place is included is abundanter comprehensively, and the investigation scope for external information is bigger.
The candidate word weight computation module, is calculated the weight of candidate word in a document, is retained and is set number
Word as the document keyword;
The weight calculation formula of the candidate word is as follows:
Wherein Pr (t | d) is weighted values of the current candidate word t in keyword document to be extracted, in each word t, Pr (w |
D) be word w in document d (word w be after keyword document participle to be extracted through past high frequency words, remove the preprocessing process such as stop words after
Remaining all words, not only keyword candidate word) weight, normalized TF-IDF values can be used, each word is being calculated
TF-IDF values when, it is necessary to select the second corpus;Now in corpus the selection of document according to keyword document to be extracted
Situation is carried out, and is typically chosen the document close with keyword Doctype to be extracted, such as keyword document to be extracted is
News category, then the document of the second corpus also corresponding selection news category, the close document of Selective type, according to TF-IDF's
The distinction of candidate word can be more embodied for principle;
Specifically, TF is all word occurrence number sums of the word w in document d in occurrence number divided by document d;IDF is
Reverse document-frequency:
When calculating TF-IDF values of the word w in document d, it is necessary to introduce the second corpus D, | D | in the second corpus
Comprising total number of documents, | { d ∈ D:W ∈ d } | to include word w number of documents in the second corpus D.TF-IDF is natural language
The mature technology of term weighing in document is calculated in processing, its ins and outs will not be repeated here.
And Pr (t | w) be current candidate word with other words (be after keyword document participle to be extracted through past high frequency words, go
It is remaining after the preprocessing process such as stop words, other words in addition to current word) (COS distance is also referred to as remaining COS distance sum
String similarity, is the degree with two vectorial angle cosine values in vector space as the size for weighing two interindividual variations
Amount).By the word2vec vector conversions to the first corpus word segmentation result, the term vector come is trained by word2vec and is had
Have of overall importance on the first domain lexicon, each word correspondence one is unique vectorial, and the word and other words are embodied in vector
The meaning of a word is far and near, common frequency of occurrence information, such as excuse A is very high with the common frequency of occurrences of word B in the first domain lexicon,
The term vector B that the term vector A that so word A is changed into is changed into word B is spatially more nearly, and its COS distance is just bigger,
This is that the weight that candidate word is calculated by COS distance provides the foundation.It regard the COS distance of current word and other words as time
The considerations of word weight calculation are selected, the reference factor of the more external informations cleverly introduced;So when some words are being waited to take out
Take frequency of occurrence in keyword document very low, can not be come out by prior art as keyword abstraction, present system
The first corpus is introduced, if the vocabulary has the very high co-occurrence frequency in the first corpus with other vocabulary, then with it
His COS distance of word is also larger, relies on TF-IDF to calculate the deficiency of keyword weight supplemented with simple so that keyword is taken out
The weight calculation formula taken is more reasonable.
Further, the system is to be loaded with computer, server or the shifting of the keyword extraction function program
Dynamic intelligent terminal.
Embodiment 1
By in the keyword extraction system of the following text input present invention, keyword abstraction is carried out, " Europe stock is low to open European master
Will be stopped business on national Christmas arrangement guide look.China Securities net news Europe stock today it is low open, the index of stoke 600 falls 0.1% to 366.12
Point;French CAC indexes fall 0.1% to 4669.76 point;100 indexes fall 0.2% to 6253.29 point when Britain is rich;German stock market is modern
Day stops business.The arrangement of stopping business of the Christmas of European major country is not quite similar, and according to certain finance and economics message, stock market of Britain stops business because of vacation on Christmas
Three and half, closing quotation half a day in advance on the 24th, the whole day of (Christmas Day) on the 25th is stopped business, and (Boxing Day) on the 26th fell in Saturday, need to be mended on 28th
Not, English stock to (Tuesday) on the 29th can just merchandise once again;French stock market stops business one day because of vacation on Christmas, is to stop business December 25;Moral
Stock market of state stopped business two days -25 days on the 24th December ".
After participle and part-of-speech tagging, setting retains part of speech and is:Nr names, nrf transliteration name, nw neologisms, nt mechanisms
The nominal corresponding word in group's name, a adjectives, nz other specific terms, v verbs, n nouns, ns ground;Taken out by present system
The keyword of taking-up is:It is low to open | | Europe stock | | Christmas stops business | | Europe | | country }, the keyword extracted compared to textrank:
Stop business | | Christmas | | stock market | | today | | France }, the keyword that present system is extracted can more embody main body " Europe stock ", " low
Open " etc. word, the frequency that " Europe stock " occurs in a document is only once, the frequency of occurrences is relatively low, but is opened in full around Europe stock is low
To deploy, the theme for reacting document is played a very important role, for this class keywords, existing technology can not extracted typically
Come, and the better extract of this class keywords is realized using present system;The extraction result of keyword is more reasonable.
Embodiment 2
By in the keyword extraction system of the following text input present invention, keyword abstraction is carried out, " so-and-so share is accused of disobeying
Anti- Securities Market Law, investigates so-and-so share (600***) night on the 25th bulletin, 25, company received stock supervisory committee by stock supervisory committee《Investigation is logical
Know book》.Because company is accused of violating Securities Act regulation, according to the pertinent regulations of Securities Market Law, stock supervisory committee determines tune of being put on record to company
Look into." keyword abstraction result it is as follows:So-and-so share | | Securities Market Law | | investigation | | stock supervisory committee | | company | | }, and by existing
Textrank technologies extract keyword results be:Stock supervisory committee | | company | | investigation | | Securities Market Law | | violate | | }.The present invention
System has preferably extracted descriptor as so-and-so share, and the result extracted relative to textrank methods is better able to instead
Reflect the theme of document;It is worth noting that the parameter that the part of speech that need to once retain and other needs are set is determined, present invention system
System is exactly to belong to unsupervised learning process, and extraction efficiency is higher;But extraction result is also taken out compared to supervised learning and manually
A certain distance is taken;But this does not influence present system compared to the technological progress of the keyword technology of existing unsupervised learning
Property.
Claims (9)
1. keyword abstraction system, it is characterised in that the system includes:Pretreatment module, term vector modular converter, part of speech mark
Injection molding block, part of speech screening module and candidate word weight computation module;
The pretreatment module includes participle to the progress of the first corpus, the second corpus and document to be extracted, goes high frequency words, goes
The processing of stop words;
Word in the first corpus after pretreatment module processing is converted into vector by the term vector modular converter;
The part-of-speech tagging module carries out part-of-speech tagging to the word in the document to be extracted after pretreatment module processing;
The part of speech screening module, is screened to the word in the document to be extracted after part-of-speech tagging, is only retained and is set part of speech
Word as keyword candidate word;
The candidate word weight computation module, is calculated the weight of candidate word in a document, is retained and is set the word of number to make
For the keyword of the document.
2. the system as claimed in claim 1, it is characterised in that:The candidate word weight computation module is using below equation come real
Now to realizing the calculating to candidate word weight:
Wherein Pr (t | d) it is weighted values of the current candidate word t in keyword document to be extracted;Pr (w | d) is word w in document d
In TF-IDF values;And Pr (t | w) it is current candidate word and other word COS distance sums in document.
3. the system as shown in claim 2, it is characterised in that:The system also includes the first corpus memory module and second
Corpus memory module, the second language materials of number of documents > place that first corpus is included include number of documents 10 times.
4. system as claimed in claim 3, it is characterised in that after the term vector modular converter uses Word2vec to participle
The first corpus in word enter row vector conversion.
5. system as claimed in claim 4, it is characterised in that the part of speech screening module includes retaining part of speech to keyword
Part of speech setup module, system are in use, user selects the keyword part of speech for needing to extract by part of speech setup module.
6. system as claimed in claim 5, it is characterised in that the default setting of the part of speech setup module is reservation nr people
Name, nrf transliteration name, nw neologisms, group of nt mechanisms name, a adjectives, nz other specific terms, v verbs, n nouns, ns place names.
7. system as claimed in claim 6, it is characterised in that the candidate word weight computation module includes needing to export crucial
The setup module of word number or threshold value, user inputs arranges value to complete to keyword extraction or threshold value in setup module
Setting.
8. system as claimed in claim 7, it is characterised in that the setup module has default setting, when user is not right
Keyword number or when being configured of threshold value, the weight computation module export corresponding keyword according to default setting.
9. system as claimed in claim 8, it is characterised in that the system is loading just like described in one of claim 1 to 7
Computer, server or the mobile intelligent terminal of function program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710211226.XA CN106997344A (en) | 2017-03-31 | 2017-03-31 | Keyword abstraction system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710211226.XA CN106997344A (en) | 2017-03-31 | 2017-03-31 | Keyword abstraction system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106997344A true CN106997344A (en) | 2017-08-01 |
Family
ID=59435732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710211226.XA Pending CN106997344A (en) | 2017-03-31 | 2017-03-31 | Keyword abstraction system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106997344A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107832306A (en) * | 2017-11-28 | 2018-03-23 | 武汉大学 | A kind of similar entities method for digging based on Doc2vec |
CN108052630A (en) * | 2017-12-19 | 2018-05-18 | 中山大学 | It is a kind of that the method for expanding word is extracted based on Chinese education video |
CN108376131A (en) * | 2018-03-14 | 2018-08-07 | 中山大学 | Keyword abstraction method based on seq2seq deep neural network models |
CN108388597A (en) * | 2018-02-01 | 2018-08-10 | 深圳市鹰硕技术有限公司 | Conference summary generation method and device |
CN108460018A (en) * | 2018-02-28 | 2018-08-28 | 首都师范大学 | A kind of Chinese chapter theme expression power analysis method based on syntax predicate cluster |
CN108549625A (en) * | 2018-02-28 | 2018-09-18 | 首都师范大学 | A kind of Chinese chapter Behaviour theme analysis method based on syntax object cluster |
CN108595425A (en) * | 2018-04-20 | 2018-09-28 | 昆明理工大学 | Based on theme and semantic dialogue language material keyword abstraction method |
CN109800219A (en) * | 2019-01-18 | 2019-05-24 | 广东小天才科技有限公司 | A kind of method and apparatus of corpus cleaning |
CN110298033A (en) * | 2019-05-29 | 2019-10-01 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Keyword corpus labeling trains extracting tool |
CN111046169A (en) * | 2019-12-24 | 2020-04-21 | 东软集团股份有限公司 | Method, device and equipment for extracting subject term and storage medium |
CN112364624A (en) * | 2020-11-04 | 2021-02-12 | 重庆邮电大学 | Keyword extraction method based on deep learning language model fusion semantic features |
CN113486155A (en) * | 2021-07-28 | 2021-10-08 | 国际关系学院 | Chinese naming method fusing fixed phrase information |
CN113743090A (en) * | 2021-09-08 | 2021-12-03 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102375842A (en) * | 2010-08-20 | 2012-03-14 | 姚尹雄 | Method for evaluating and extracting keyword set in whole field |
CN103106275A (en) * | 2013-02-08 | 2013-05-15 | 西北工业大学 | Text classification character screening method based on character distribution information |
CN104199846A (en) * | 2014-08-08 | 2014-12-10 | 杭州电子科技大学 | Comment subject term clustering method based on Wikipedia |
CN104933164A (en) * | 2015-06-26 | 2015-09-23 | 华南理工大学 | Method for extracting relations among named entities in Internet massive data and system thereof |
CN105159932A (en) * | 2015-08-07 | 2015-12-16 | 南车青岛四方机车车辆股份有限公司 | Data retrieving and sorting system and method |
CN105243152A (en) * | 2015-10-26 | 2016-01-13 | 同济大学 | Graph model-based automatic abstracting method |
CN106021272A (en) * | 2016-04-04 | 2016-10-12 | 上海大学 | Keyword automatic extraction method based on distributed expression word vector calculation |
CN106294320A (en) * | 2016-08-04 | 2017-01-04 | 武汉数为科技有限公司 | A kind of terminology extraction method and system towards scientific paper |
CN106502994A (en) * | 2016-11-29 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | A kind of method and apparatus of the keyword extraction of text |
-
2017
- 2017-03-31 CN CN201710211226.XA patent/CN106997344A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102375842A (en) * | 2010-08-20 | 2012-03-14 | 姚尹雄 | Method for evaluating and extracting keyword set in whole field |
CN103106275A (en) * | 2013-02-08 | 2013-05-15 | 西北工业大学 | Text classification character screening method based on character distribution information |
CN104199846A (en) * | 2014-08-08 | 2014-12-10 | 杭州电子科技大学 | Comment subject term clustering method based on Wikipedia |
CN104933164A (en) * | 2015-06-26 | 2015-09-23 | 华南理工大学 | Method for extracting relations among named entities in Internet massive data and system thereof |
CN105159932A (en) * | 2015-08-07 | 2015-12-16 | 南车青岛四方机车车辆股份有限公司 | Data retrieving and sorting system and method |
CN105243152A (en) * | 2015-10-26 | 2016-01-13 | 同济大学 | Graph model-based automatic abstracting method |
CN106021272A (en) * | 2016-04-04 | 2016-10-12 | 上海大学 | Keyword automatic extraction method based on distributed expression word vector calculation |
CN106294320A (en) * | 2016-08-04 | 2017-01-04 | 武汉数为科技有限公司 | A kind of terminology extraction method and system towards scientific paper |
CN106502994A (en) * | 2016-11-29 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | A kind of method and apparatus of the keyword extraction of text |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107832306A (en) * | 2017-11-28 | 2018-03-23 | 武汉大学 | A kind of similar entities method for digging based on Doc2vec |
CN108052630A (en) * | 2017-12-19 | 2018-05-18 | 中山大学 | It is a kind of that the method for expanding word is extracted based on Chinese education video |
CN108052630B (en) * | 2017-12-19 | 2020-12-08 | 中山大学 | Method for extracting expansion words based on Chinese education videos |
CN108388597A (en) * | 2018-02-01 | 2018-08-10 | 深圳市鹰硕技术有限公司 | Conference summary generation method and device |
CN108460018B (en) * | 2018-02-28 | 2020-11-06 | 首都师范大学 | Chinese chapter theme expression analysis method based on syntactic predicate clustering |
CN108549625A (en) * | 2018-02-28 | 2018-09-18 | 首都师范大学 | A kind of Chinese chapter Behaviour theme analysis method based on syntax object cluster |
CN108460018A (en) * | 2018-02-28 | 2018-08-28 | 首都师范大学 | A kind of Chinese chapter theme expression power analysis method based on syntax predicate cluster |
CN108549625B (en) * | 2018-02-28 | 2020-11-17 | 首都师范大学 | Chinese chapter expression theme analysis method based on syntactic object clustering |
CN108376131A (en) * | 2018-03-14 | 2018-08-07 | 中山大学 | Keyword abstraction method based on seq2seq deep neural network models |
CN108595425A (en) * | 2018-04-20 | 2018-09-28 | 昆明理工大学 | Based on theme and semantic dialogue language material keyword abstraction method |
CN109800219A (en) * | 2019-01-18 | 2019-05-24 | 广东小天才科技有限公司 | A kind of method and apparatus of corpus cleaning |
CN110298033B (en) * | 2019-05-29 | 2022-07-08 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Keyword corpus labeling training extraction system |
CN110298033A (en) * | 2019-05-29 | 2019-10-01 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Keyword corpus labeling trains extracting tool |
CN111046169A (en) * | 2019-12-24 | 2020-04-21 | 东软集团股份有限公司 | Method, device and equipment for extracting subject term and storage medium |
CN111046169B (en) * | 2019-12-24 | 2024-03-26 | 东软集团股份有限公司 | Method, device, equipment and storage medium for extracting subject term |
CN112364624A (en) * | 2020-11-04 | 2021-02-12 | 重庆邮电大学 | Keyword extraction method based on deep learning language model fusion semantic features |
CN112364624B (en) * | 2020-11-04 | 2023-09-26 | 重庆邮电大学 | Keyword extraction method based on deep learning language model fusion semantic features |
CN113486155A (en) * | 2021-07-28 | 2021-10-08 | 国际关系学院 | Chinese naming method fusing fixed phrase information |
CN113486155B (en) * | 2021-07-28 | 2022-05-20 | 国际关系学院 | Chinese naming method fusing fixed phrase information |
CN113743090A (en) * | 2021-09-08 | 2021-12-03 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
CN113743090B (en) * | 2021-09-08 | 2024-04-12 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106997344A (en) | Keyword abstraction system | |
Kwaik et al. | Shami: A corpus of levantine arabic dialects | |
Schmitz | Inducing ontology from flickr tags | |
Hammond et al. | Semantic enhancement engine: A modular document enhancement platform for semantic applications over heterogeneous content | |
Ahmed et al. | Language identification from text using n-gram based cumulative frequency addition | |
Saravanan et al. | Identification of rhetorical roles for segmentation and summarization of a legal judgment | |
CN106021272A (en) | Keyword automatic extraction method based on distributed expression word vector calculation | |
CN107180026B (en) | Event phrase learning method and device based on word embedding semantic mapping | |
CN112380848B (en) | Text generation method, device, equipment and storage medium | |
Jayaram et al. | A review: Information extraction techniques from research papers | |
Zhang et al. | A Chinese question-answering system with question classification and answer clustering | |
Singh et al. | Writing Style Change Detection on Multi-Author Documents. | |
CN106997345A (en) | The keyword abstraction method of word-based vector sum word statistical information | |
Mollaei et al. | Question classification in Persian language based on conditional random fields | |
Hakkani-Tur et al. | Statistical sentence extraction for information distillation | |
Iacobelli et al. | Finding new information via robust entity detection | |
Hamdi et al. | Machine learning vs deterministic rule-based system for document stream segmentation | |
Hasan et al. | Pattern-matching based for Arabic question answering: a challenge perspective | |
Kim et al. | Question Answering Considering Semantic Categories and Co-Occurrence Density. | |
CN107818078B (en) | Semantic association and matching method for Chinese natural language dialogue | |
Goweder et al. | Identifying broken plurals in unvowelised arabic tex | |
Eghbalzadeh et al. | Persica: A Persian corpus for multi-purpose text mining and Natural language processing | |
Sirajzade et al. | The LuNa Open Toolbox for the Luxembourgish Language | |
Kaur et al. | Keyword extraction for punjabi language | |
CN113590738A (en) | Method for detecting network sensitive information based on content and emotion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170801 |
|
WD01 | Invention patent application deemed withdrawn after publication |