CN106997344A

CN106997344A - Keyword abstraction system

Info

Publication number: CN106997344A
Application number: CN201710211226.XA
Authority: CN
Inventors: 罗镇权; 罗强; 刘世林; 练睿; 闫俊杰
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2017-08-01

Abstract

The present invention relates to natural language processing field, more particularly to keyword abstraction system；Pretreatment module, term vector modular converter, part-of-speech tagging module, part of speech screening module and candidate word weight computation module；System limits the part of speech of keyword extraction by part of speech screening, and then adjust the direction of keyword abstraction, and train term vector by introducing Large Scale Corpus, dependent on the term vector trained in Large Scale Corpus, weight of the candidate in document to be extracted is calculated by way of COS distance and IF IDF weights are combined, the investigation scope of keyword abstraction is expanded by the introducing of outside corpus, so that the result of keyword abstraction is more reasonable, new tool is provided for effective extraction of keyword.

Description

Keyword abstraction system

Technical field

Natural language processing field of the present invention, more particularly to keyword abstraction system.

Background technology

As the fast development of internet is with the arrival in big data epoch, in actual life, substantial amounts of human contact All it is to be existed with electrical file form to information, in face of these vast as the open sea information, people can be automatic in the urgent need to machine The keyword of article purport can most be represented by identifying, help people to understand article main contents faster, saved people and read, place Reason and the time using these electronic documents.

Current this technology is referred to as keyword abstraction (Keyword Extraction), and keyword abstraction is referred to quickly Obtained from document it is multiple can represent the word or phrase of document subject matter, it is general as a kind of refining to the document main contents Condition.It can quickly understand the main contents of document by keyword people, efficiently hold document subject matter.Keyword extensively should For wide spectrums such as news report, technical papers, by that can allow people's efficiently management and retrieval document.With information with The growth of index step velocity, keyword turns into the important and main instrument that user retrieves content of interest in magnanimity information, People's search engine used in everyday is all operated by keyword.Keyword in the document of different time sections is used Change in terms of frequency, inherent meaning also turns into the important way that research human society, economy, culture and political conception are developed Footpath.

For the prior art of keyword abstraction, a kind of approach is to use unsupervised method, utilizes candidate keywords They are sorted by statistical property, such as (TFIDF), choose highest several as keyword, but this mode this be single Pure make use of statistical property, and document subject matter is found just with the degree of polymerization of inside documents information, that is, word.The party Method is disadvantageous in that the Limited information of a document, often can not be to find that document subject matter provides enough information, one In a little documents, indivedual very important keywords, although the frequency of appearance is relatively low, but the reaction of the theme for article There is very important effect, at this moment, only by statistical method, these words cannot be extracted.In such situation Down, it is necessary to which a kind of new keyword automatically extracts instrument, to make the extraction result of keyword more reasonable.

The content of the invention

It is above-mentioned not enough there is provided keyword abstraction system in the presence of prior art it is an object of the invention to overcome, it is real Now to the automatic extraction of keyword, the system when carrying out keyword abstraction by part-of-speech information to document in after participle word enter Row screening, it is ensured that the draw-off direction of keyword, and improve computational efficiency；More external informations are introduced on this basis for document Keyword abstraction provide support, expanded keyword extraction investigate scope so that the result of keyword abstraction is more reasonable.

To realize the goal of the invention, the present invention provides following technical scheme：

Keyword abstraction system, the system includes：Pretreatment module, term vector modular converter, part-of-speech tagging module, word Property screening module and candidate word weight computation module；

The pretreatment module carries out including participle, removes high frequency to the first corpus, the second corpus and document to be extracted Word, the processing for removing stop words；

Word in the first corpus after pretreatment module processing is converted into by the term vector modular converter Vector；

The part-of-speech tagging module carries out part of speech to the word in the document to be extracted after pretreatment module processing Mark；

The part of speech screening module, is screened to the word in the document to be extracted after part-of-speech tagging, is only retained and is set The word of part of speech as keyword candidate word；

The candidate word weight computation module, is calculated the weight of candidate word in a document, is retained and is set number Word as the document keyword；

The candidate word weight computation module is realized using below equation to realizing the calculating to candidate word weight：

Wherein Pr (t | d) it is weighted values of the current candidate word t in keyword document to be extracted；Pr (w | d) is word w in text TF-IDF values in shelves d；And Pr (t | w) it is current candidate word and other word COS distance sums in document.

Further, the system also includes the first corpus memory module and the second corpus memory module, described the The second language materials of number of documents ＞ place that one corpus is included include number of documents 10 times.

As a preferred embodiment, the term vector modular converter uses Word2vec to the word in the first corpus after participle Enter row vector conversion.

Further, the part of speech screening module includes the part of speech setup module for retaining keyword part of speech；The part of speech Setup module provide user carry out the interface that crucial part of speech is set, system in use, user can by part of speech setup module come Selection needs the keyword part of speech extracted.The part of speech setup module can be supplied to user by the way of frame is selected.

Further, the default setting of part of speech setup module is reservation nr names, nrf transliteration name, nw neologisms, nt mechanisms Group's name, a adjectives, nz other specific terms, v verbs, n nouns, ns place names, when user does not carry out keyword part of speech setting When, system retains candidate word of the keyword as keyword of correspondence part of speech according to default setting.

Further, the candidate word weight computation module includes the setting mould for needing to export keyword number or threshold value Block, user inputs arranges value to complete the setting to keyword extraction or threshold value in setup module.

Further, the setup module has default setting, when user is not selected, the weight calculation mould Root tuber exports corresponding keyword according to default setting.

Further, the system is to be loaded with computer, server or the shifting of the keyword extraction function program Dynamic intelligent terminal.

Compared with prior art, beneficial effects of the present invention：The present invention provides keyword abstraction system, the system bag Include：Pretreatment module, term vector modular converter, part-of-speech tagging module, part of speech screening module and candidate word weight computation module； Based on the system realize keyword abstraction compared to existing keyword abstraction technology, by part-of-speech information to participle after Candidate word is screened, it is ensured that the draw-off direction of keyword, more with the specific aim to analysis directions, and improves calculating effect Rate；Moreover, the system is by selecting large-scale first corpus, by Word2vec come to Large Scale Corpus The vector conversion of middle word segmentation result, the term vector after conversion is with semantic approximate with other words and common in large-scale corpus With the relation of frequency of occurrence, when calculating the weight of candidate word, using the COS distance of candidate word and other words as Consideration, More external informations are introduced for the keyword abstraction of keyword document to be extracted by term vector, have expanded examining for keyword extraction Scope is examined, is overcome when document information amount to be extracted is few, prior art extracts the technological deficiency of effect difference.At natural language In reason, the extraction of keyword provides new instrument.

Brief description of the drawings：

Fig. 1 is the system construction drawing of this keyword abstraction system.

Embodiment

With reference to test example and embodiment, the present invention is described in further detail.But this should not be understood Following embodiment is only limitted to for the scope of above-mentioned theme of the invention, it is all that this is belonged to based on the technology that present invention is realized The scope of invention.

Keyword abstraction system is provided, compared to existing keyword abstraction technology, more Considerations are introduced, by Part-of-speech information is screened to the candidate word after participle, it is ensured that the draw-off direction of keyword, and improves computational efficiency；In this base More external informations are introduced on plinth and provide support for the keyword extraction of document, scope is investigated in the extraction for having expanded keyword, is made The result collected of keyword is more reasonable.

Keyword abstraction system, the system as shown in Figure 1 includes：Pretreatment module, term vector modular converter, part of speech mark Injection molding block, part of speech screening module and candidate word weight computation module.

The pretreatment module carries out including participle, removes high frequency to the first corpus, the second corpus and document to be extracted Word, the processing for removing stop words；The participle instrument that target can be used is also a lot, such as Stamford participle instrument, Harbin Institute of Technology LTP, Also there are the participle instrument of internal R＆D in Computer Department of the Chinese Academy of Science NLPIR, Tsing-Hua University THULAC and jieba, many enterprises.

Word in the first corpus after pretreatment module processing is converted into by the term vector modular converter Vector；Apply very extensive word2vec at present to realize that the vector of word after corpus participle turns as a preferred embodiment, can be used Change, the vector conversion for the word that word2vec is realized can embody approximation relation semantic between word and word and the co-occurrence frequency is closed System, word that can be by meaning in document relatively, or the higher word of the co-occurrence frequency in a document, are converted into locus Closer vector.

The part-of-speech tagging module carries out part of speech to the word in the document to be extracted after pretreatment module processing Mark；The instrument of current part-of-speech tagging is a lot, and the vocabulary after participle is labeled using part-of-speech tagging instrument, is word-based Property candidate word screening prepared premise.

The part of speech screening module, is screened to the word in the document to be extracted after part-of-speech tagging, is only retained and is set The word of part of speech as keyword candidate word.The word in keyword document to be extracted is sieved according to the result of part-of-speech tagging Choosing, the vocabulary for only retaining setting part of speech is used as the candidate word of keyword.

Further, the part of speech screening module includes the setup module for retaining keyword part of speech, in use, user can The part of speech of keyword is set with the direction according to analysis；Understand that a document needs the key obtained with different analysis directions Word may also be different, and the keyword that existing keyword abstraction instrument is extracted lacks the specific aim to analysis directions；The present invention System can need the direction analyzed to set the part of speech of keyword to be extracted according to user；And then extract the pass of correspondence part of speech Keyword, the specific aim to analysis directions is stronger；And because present system is sieved by the part of speech of setting to candidate word Choosing, so in the calculating process in later stage, is only carried out to retaining vocabulary；Amount of calculation is reduced, the efficiency of calculating is improved.

The part of speech setup module can be supplied to user in the form of the mode of frame is selected with web port.

Further, the default setting of part of speech setup module is reservation nr names, nrf transliteration name, nw neologisms, nt mechanisms Group's name, a adjectives, nz other specific terms, v verbs, n nouns, ns place names, when user does not carry out keyword part of speech setting When, system retains candidate word of the keyword as keyword of correspondence part of speech according to default setting.Increase with default setting and be The versatility of system, system can also export corresponding knot according to the default setting of system when user need not carry out special setting Really.

The system also includes the first corpus memory module and the second corpus memory module, the first corpus bag The second language materials of the number of documents ＞ place contained include number of documents 10 times.Such as say in the first corpus comprising 100000 texts 1000 documents are included in shelves, the second corpus.First corpus is used for training term vector as external information introducing, thus the The document that one language material place is included is abundanter comprehensively, and the investigation scope for external information is bigger.

The weight calculation formula of the candidate word is as follows：

Wherein Pr (t | d) is weighted values of the current candidate word t in keyword document to be extracted, in each word t, Pr (w | D) be word w in document d (word w be after keyword document participle to be extracted through past high frequency words, remove the preprocessing process such as stop words after Remaining all words, not only keyword candidate word) weight, normalized TF-IDF values can be used, each word is being calculated TF-IDF values when, it is necessary to select the second corpus；Now in corpus the selection of document according to keyword document to be extracted Situation is carried out, and is typically chosen the document close with keyword Doctype to be extracted, such as keyword document to be extracted is News category, then the document of the second corpus also corresponding selection news category, the close document of Selective type, according to TF-IDF's The distinction of candidate word can be more embodied for principle；

Specifically, TF is all word occurrence number sums of the word w in document d in occurrence number divided by document d；IDF is Reverse document-frequency：

When calculating TF-IDF values of the word w in document d, it is necessary to introduce the second corpus D, | D | in the second corpus Comprising total number of documents, | { d ∈ D：W ∈ d } | to include word w number of documents in the second corpus D.TF-IDF is natural language The mature technology of term weighing in document is calculated in processing, its ins and outs will not be repeated here.

And Pr (t | w) be current candidate word with other words (be after keyword document participle to be extracted through past high frequency words, go It is remaining after the preprocessing process such as stop words, other words in addition to current word) (COS distance is also referred to as remaining COS distance sum String similarity, is the degree with two vectorial angle cosine values in vector space as the size for weighing two interindividual variations Amount).By the word2vec vector conversions to the first corpus word segmentation result, the term vector come is trained by word2vec and is had Have of overall importance on the first domain lexicon, each word correspondence one is unique vectorial, and the word and other words are embodied in vector The meaning of a word is far and near, common frequency of occurrence information, such as excuse A is very high with the common frequency of occurrences of word B in the first domain lexicon, The term vector B that the term vector A that so word A is changed into is changed into word B is spatially more nearly, and its COS distance is just bigger, This is that the weight that candidate word is calculated by COS distance provides the foundation.It regard the COS distance of current word and other words as time The considerations of word weight calculation are selected, the reference factor of the more external informations cleverly introduced；So when some words are being waited to take out Take frequency of occurrence in keyword document very low, can not be come out by prior art as keyword abstraction, present system The first corpus is introduced, if the vocabulary has the very high co-occurrence frequency in the first corpus with other vocabulary, then with it His COS distance of word is also larger, relies on TF-IDF to calculate the deficiency of keyword weight supplemented with simple so that keyword is taken out The weight calculation formula taken is more reasonable.

Embodiment 1

By in the keyword extraction system of the following text input present invention, keyword abstraction is carried out, " Europe stock is low to open European master Will be stopped business on national Christmas arrangement guide look.China Securities net news Europe stock today it is low open, the index of stoke 600 falls 0.1% to 366.12 Point；French CAC indexes fall 0.1% to 4669.76 point；100 indexes fall 0.2% to 6253.29 point when Britain is rich；German stock market is modern Day stops business.The arrangement of stopping business of the Christmas of European major country is not quite similar, and according to certain finance and economics message, stock market of Britain stops business because of vacation on Christmas Three and half, closing quotation half a day in advance on the 24th, the whole day of (Christmas Day) on the 25th is stopped business, and (Boxing Day) on the 26th fell in Saturday, need to be mended on 28th Not, English stock to (Tuesday) on the 29th can just merchandise once again；French stock market stops business one day because of vacation on Christmas, is to stop business December 25；Moral Stock market of state stopped business two days -25 days on the 24th December ".

After participle and part-of-speech tagging, setting retains part of speech and is：Nr names, nrf transliteration name, nw neologisms, nt mechanisms The nominal corresponding word in group's name, a adjectives, nz other specific terms, v verbs, n nouns, ns ground；Taken out by present system The keyword of taking-up is：It is low to open | | Europe stock | | Christmas stops business | | Europe | | country }, the keyword extracted compared to textrank： Stop business | | Christmas | | stock market | | today | | France }, the keyword that present system is extracted can more embody main body " Europe stock ", " low Open " etc. word, the frequency that " Europe stock " occurs in a document is only once, the frequency of occurrences is relatively low, but is opened in full around Europe stock is low To deploy, the theme for reacting document is played a very important role, for this class keywords, existing technology can not extracted typically Come, and the better extract of this class keywords is realized using present system；The extraction result of keyword is more reasonable.

Embodiment 2

By in the keyword extraction system of the following text input present invention, keyword abstraction is carried out, " so-and-so share is accused of disobeying Anti- Securities Market Law, investigates so-and-so share (600***) night on the 25th bulletin, 25, company received stock supervisory committee by stock supervisory committee《Investigation is logical Know book》.Because company is accused of violating Securities Act regulation, according to the pertinent regulations of Securities Market Law, stock supervisory committee determines tune of being put on record to company Look into." keyword abstraction result it is as follows：So-and-so share | | Securities Market Law | | investigation | | stock supervisory committee | | company | | }, and by existing Textrank technologies extract keyword results be：Stock supervisory committee | | company | | investigation | | Securities Market Law | | violate | | }.The present invention System has preferably extracted descriptor as so-and-so share, and the result extracted relative to textrank methods is better able to instead Reflect the theme of document；It is worth noting that the parameter that the part of speech that need to once retain and other needs are set is determined, present invention system System is exactly to belong to unsupervised learning process, and extraction efficiency is higher；But extraction result is also taken out compared to supervised learning and manually A certain distance is taken；But this does not influence present system compared to the technological progress of the keyword technology of existing unsupervised learning Property.

Claims

1. keyword abstraction system, it is characterised in that the system includes：Pretreatment module, term vector modular converter, part of speech mark Injection molding block, part of speech screening module and candidate word weight computation module；

The pretreatment module includes participle to the progress of the first corpus, the second corpus and document to be extracted, goes high frequency words, goes The processing of stop words；

Word in the first corpus after pretreatment module processing is converted into vector by the term vector modular converter；

The part-of-speech tagging module carries out part-of-speech tagging to the word in the document to be extracted after pretreatment module processing；

The part of speech screening module, is screened to the word in the document to be extracted after part-of-speech tagging, is only retained and is set part of speech Word as keyword candidate word；

The candidate word weight computation module, is calculated the weight of candidate word in a document, is retained and is set the word of number to make For the keyword of the document.

2. the system as claimed in claim 1, it is characterised in that：The candidate word weight computation module is using below equation come real Now to realizing the calculating to candidate word weight：

\Pr (t | d) = \underset{w &Element; d}{Σ} \Pr (t | w) \Pr (w | d)

Wherein Pr (t | d) it is weighted values of the current candidate word t in keyword document to be extracted；Pr (w | d) is word w in document d In TF-IDF values；And Pr (t | w) it is current candidate word and other word COS distance sums in document.

3. the system as shown in claim 2, it is characterised in that：The system also includes the first corpus memory module and second Corpus memory module, the second language materials of number of documents ＞ place that first corpus is included include number of documents 10 times.

4. system as claimed in claim 3, it is characterised in that after the term vector modular converter uses Word2vec to participle The first corpus in word enter row vector conversion.

5. system as claimed in claim 4, it is characterised in that the part of speech screening module includes retaining part of speech to keyword Part of speech setup module, system are in use, user selects the keyword part of speech for needing to extract by part of speech setup module.

6. system as claimed in claim 5, it is characterised in that the default setting of the part of speech setup module is reservation nr people Name, nrf transliteration name, nw neologisms, group of nt mechanisms name, a adjectives, nz other specific terms, v verbs, n nouns, ns place names.

7. system as claimed in claim 6, it is characterised in that the candidate word weight computation module includes needing to export crucial The setup module of word number or threshold value, user inputs arranges value to complete to keyword extraction or threshold value in setup module Setting.

8. system as claimed in claim 7, it is characterised in that the setup module has default setting, when user is not right Keyword number or when being configured of threshold value, the weight computation module export corresponding keyword according to default setting.

9. system as claimed in claim 8, it is characterised in that the system is loading just like described in one of claim 1 to 7 Computer, server or the mobile intelligent terminal of function program.