CN110532390A

CN110532390A - A kind of news keyword extracting method based on NER and Complex Networks Feature

Info

Publication number: CN110532390A
Application number: CN201910790303.0A
Authority: CN
Inventors: 纪明轩; 宋玉蓉
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2019-12-03
Anticipated expiration: 2039-08-26
Also published as: CN110532390B

Abstract

The news keyword extracting method based on NER and Complex Networks Feature that the invention discloses a kind of, Entity recognition (Named Entities Recognition will be named, NER) with natural language in complex network (Complex Networks, CN) characteristic combines, propose novel keyword abstraction method --- the method (NER-CN) based on NER combination Complex Networks Feature, this method, which is first labeled sentence, is named Entity recognition analysis, then it constructs text complex network and keyword abstraction is carried out according to the global metric of the node in network.Keyword abstraction method proposed by the invention, in the accurate rate of text classification, has significant raising in the indexs such as recall rate and F1 value compared to conventional method.

Description

A kind of news keyword extracting method based on NER and Complex Networks Feature

Technical field

The network characteristic of word, will name Entity recognition (Named Entities in the Chinese newsletter archive of present invention research Recognition, NER) it combines, proposes with complex network (Complex Networks, CN) characteristic in natural language Novel keyword abstraction algorithm --- the algorithm (NER-CN) based on NER combination Complex Networks Feature belongs to NLP technology neck Domain.

Background technique

In recent years, since data explosion formula increases and the promotion of computing capability, how quick from mass data user is Extracting useful information just has higher technical requirements.And the feature selecting of text is as the important link in text analyzing, Its performance is just particularly important classifying quality.

Traditional text feature has TF-IDF, TextRank, LDA, information gain etc..But due to language itself Complexity, when extracting text feature using these methods, be easy to ignore text self structure information and manufacture bulk redundancy letter Breath.In order to retain the structural information in text, the community information in natural language is mapped in complex network with semantic structure Become burning hot research direction.

Keyword extraction techniques are the bases of natural language processing field, in recent years, are had both at home and abroad to it more deep Research.File: Amancio D R.Probing the Topological Properties of Complex Networks Modeling Short Written Texts [J] .PLoS One, 2014,10 (2): e0118394 is proposed can be with Short text is analyzed with the method for complex network and concept, Ziwen sheet is extracted with reasonable manner, word of the building based on grammer is total Existing network analyzes the complex network characteristic of dynamic short text, and passes through the text classification experimental verification of SVM algorithm this method Superiority.File: De Arruda H F, Costa L D F, Amancio D R.Using complex networks for text classification:Discriminating informative and imaginative documents[J] .EPL (Europhysics Letters), 2016,113 (2): 28007 inquired into how to be efficiently used in classification task from Feature obtained in text retrieval conference TREC, they have carried out supervised classification, it is intended to distinguish information and image file, use description function Local topology/dynamic characteristic network model of energy word, substantially increases the accuracy of text classification.File: Tang Jun complex web Application [J] Yunnan Institute for nationalities journal (natural science edition) of the network in news web page keyword extraction, 2012,21 (4) will add The characteristics of characteristic of power complex network is introduced into this link of keyword extraction, analyzes news web page document and node weight Weight, describes the cluster coefficients and central part of directed networks weight, the advantages of using traditional algorithm, proposes a kind of improved News keyword method is automatically extracted, experiments have shown that the algorithm is feasible.Existing file: Zhan Z J, Lin F, Yang X P.Keyword Extraction of Document Based on Weighted Complex Network[J] .Advanced Materials Research, 2011,403-408:2146-2151 have studied the complex network of Chinese composition Feature proposes a kind of automatic keyword extraction algorithm of the Chinese document based on Complex Networks Feature, according in linguistic network Theoretical result in worldlet structure and complex network, the characteristics extraction based on word node in document language network are crucial Word, the experimental results showed that, which has higher mean accuracy compared to traditional TF-IDF algorithm.

Although the studies above improves the application of complex network in the text in certain level, still exist following Problem: in the news report of specific topics, it often will appear the place name of some special entities, name and date etc., tradition Keyword extraction algorithm can not effectively extract these entity informations.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the present invention provides a kind of based on NER and complex network The news keyword extracting method of feature, the invention proposes for indicating the network model of relationship between document entity.First Small-sized corpus is constructed, it is complicated according to the cooccurrence relation building text between syntactic relation and word to complete text training Network.The degree centrality of comprehensive analysis node in a network creates network node overall situation metric formula close to centrality.It examines Consider monosyllabic word in text to interfere keyword extraction, individual character word is first removed in dotcom world node and commonly deactivates Word, the then global metric of calculate node.

Technical solution: to achieve the above object, the technical solution adopted by the present invention are as follows:

A kind of news keyword extracting method based on NER and Complex Networks Feature, comprising the following steps:

Step 1: collecting news content, news content is generated into listed files under original expectation file, setting filtering threshold Value, the content that number of characters is less than filtering threshold are filtered, and for each corpus, canonical matches url and content；It is all News content will be stored in respectively in each txt file according to the classification in url；

Step 2: the content in each txt file being segmented, removes stop words using stammerer participle；

Step 3: sentence being labeled using name entity recognition method neural network based, then carries out date knowledge Not, name identification and place name identification, extract name entity important in text；

Step 4: building word complex network；The word that step 3 is obtained carries out digital coding, using coding result as section Point, uses distance 2 as the distance of word association relationship, i.e. distance has even side between the word within 2, follows to each sentence Ring judgement；

V={ v₁,v₂…v_nIt is a set for having N number of node, (v_i,v_j) indicate node v_i∈ V and v_jSide between ∈ V, G (V, E) be using V as node set, withFor the figure of line set, node v_iDegree centrality DC_i Are as follows:

Wherein, k_iFor the degree of node, i.e., the number of coupled node, N is the number of nodes；

Calculate a node v_iInto network, the average value of the distance of all nodes, is denoted as d_i, that is, have:

Wherein, d_ijIt is node v_iTo node v_jDistance, in network, distance average L between all nodes is with following Formula is calculated:

By d_iInverse be defined as node v_iClose to centrality, with mark CC_iTo indicate:

It will be combined to obtain new pitch point importance evaluation index overall situation metric close to centrality and degree centrality, it will Node v_iGlobal metric mark CM_iIt indicates:

CM_i=α DC_i+βCC_i (5)

Wherein, DC_iFor nodes v_iDegree centrality, CC_iFor node v_iClose to centrality, α is that degree centrality can Adjustment parameter, β are to extract keyword according to obtained global metric close to centrality customized parameter, and alpha+beta=1.

Preferred: filtering threshold is 30 in step 1.

It is preferred: name entity recognition method neural network based in step 3: to do sequence using BiLSTM_CRF model Mark is embedded in using word insertion and word, and since input layer, the level of model is successively look-up layers, LSTM layers two-way, CRF Layer, output layer, look-up layers by word x each in sentence_iIt is the dense word vector of low-dimensional by one-hot DUAL PROBLEMS OF VECTOR MAPPING, it is two-way LSTM layers automatically extract sentence characteristics, the sequence labelling of CRF layers of progress Sentence-level；" ns " is marked as using part-of-speech tagging concentration Part construct place name identification corpus, name Entity recognition mode using based on BiLSTM-CRF model.

It is preferred: degree centrality customized parameter α=0.4, close to centrality customized parameter β=0.6.

The present invention compared with prior art, has the advantages that

Keyword extraction of the invention achieves original achievement in the classification of Chinese newsletter archive, compared to TF-IDF Algorithm and traditional text complex network developing algorithm, the extracted keyword of the present invention have been obviously improved point in classification task The accurate rate of class result, recall rate and F1 value.

Detailed description of the invention

Fig. 1 linguistic network constructs optimized flow chart

The algorithm flow chart of Fig. 2 keyword extraction

Specific embodiment

In the following with reference to the drawings and specific embodiments, the present invention is furture elucidated, it should be understood that these examples are merely to illustrate this It invents rather than limits the scope of the invention, after the present invention has been read, those skilled in the art are to of the invention various The modification of equivalent form falls within the application range as defined in the appended claims.

A kind of news keyword extracting method based on NER and Complex Networks Feature, by studying in Chinese newsletter archive The network characteristic of word, by name Entity recognition (Named Entities Recognition, NER) and answering in natural language Miscellaneous network (Complex Networks, CN) characteristic combines, and proposes novel keyword abstraction algorithm --- it is tied based on NER The algorithm (NER-CN) of Complex Networks Feature is closed, which, which is first labeled sentence, is named Entity recognition analysis, so Text complex network is constructed afterwards and keyword abstraction is carried out according to the global metric of the node in network.Experimental result shows, Keyword abstraction algorithm proposed by the invention has in the indexs such as recall rate and F1 value significant in the accurate rate of text classification It improves.Mainly test coding is carried out using python3, comprising the following steps:

Step1: collecting news content, news content is generated listed files under original expectation file, setting threshold value is 30, the content that number of characters is less than threshold value will be all filtered, and for each corpus, canonical matches url and content.Institute Some news contents will be stored in respectively in each txt file according to the classification in url.

Step2: being segmented, and the corpus handled herein is Chinese, therefore is handled Chinese word segmentation and needed using stammerer point Word.Remove stop words.Although many words largely occur in article in Chinese, to article classification, there is no what practical meanings Justice.For example " only ", word as " ", " should " may also influence finally point their calculating both wasting space time Class result.

Step3: being labeled sentence using name entity recognition method neural network based, then carries out date knowledge Not, the tasks such as name identification and place name identification, extract name entity important in text.

Name Entity recognition is a background task of natural language processing, is information extraction, and information retrieval, machine turns over It translates, the essential component part of a variety of natural language processing techniques such as question answering system.In most cases, the content in dictionary Some uncommon place names, the information such as name can not be completely covered, and the nomenclature rule of these entities is each has something to recommend him.And it identifies These words are an especially basic important ring in natural language processing task again, therefore, are needed to the identification of these words independent The task of foundation is handled, and is usually completed in the morphology processing task of early stage, this link, which is referred to as, names Entity recognition (Named Entities Recognition, NER).Chinese is made for Chinese NER task often with word is flexible and changeable It is more challenging.It names Entity recognition in news corpus and name, is achieved in the research of the entities such as place name quite original Achievement.

Used in the present invention is NER method neural network based, does sequence labelling using BiLSTM_CRF model, benefit With word insertion and word insertion, since input layer, the level of model is successively look-up layers (by word x each in sentence_iBy One-hot DUAL PROBLEMS OF VECTOR MAPPING is the dense word vector of low-dimensional), LSTM layers two-way (automatically extracting sentence characteristics), CRF layers (carry out sentence The sequence labelling of sub- grade), output layer.The part for being marked as " ns " is concentrated to construct place name identification language using part-of-speech tagging herein Material, for example, " [Hong Kong/ns especially/administrative area a/n] ns ", we can directly extract " Hong Kong Special Administrative Region " (bracket Within " ns " no longer separately as a place name).It is used herein that Entity recognition mode is named based on BiLSTM-CRF model, By taking organization names as an example, the rule of definition is as shown in the table:

1 organization of table names entity indicia

Step4: building word complex network.By word obtained in the previous step carry out digital coding, using coding result as Node, uses distance 2 as the distance of word association relationship herein, i.e. distance has even side between the word within 2.We are right Each sentence loops to determine.

Step5: the degree centrality for calculating each node is normalized it close to centrality, and carries out weighting summation Obtain the global metric of node.To each node cycle calculations.

Currently, academic it is believed that degree centrality is Local Metric index important in complex network, be close to centrality Important global Measure Indexes will be spent centrality herein and be combined with close to the different index of centrality both emphasis, mention Keyword abstraction algorithm based on Complex Networks Feature out.

If V={ v₁,v₂…v_nIt is a set for having N number of node, (v_i,v_j) indicate node v_i∈ V and v_jBetween ∈ V Side, if G (V, E) be using V as node set, withFor the figure of line set, node v_iDegree center Property DC_iAre as follows:

In formula, k_iFor the degree of node, i.e., the number of coupled node, N is the number of nodes.In degree Disposition is the local feature of node.

Other than neighbor node, in order to obtain a node in a network with the relationship of remaining remote node, to mention The global property of node is taken out, it can be by calculating a node v_iInto network, the average value of the distance of all nodes, is denoted as d_i, Have:

In formula, d_ijIt is node v_iTo node v_jDistance.In network, the distance average between all nodes is also then used Following formula is calculated:

d_iReflect node v_iRelative importance in a network: d_iIt is worth smaller, illustrates node v_iAt a distance from other nodes It is smaller.Herein by d_iInverse be defined as node v_iClose to centrality, with mark CC_iTo indicate:

Aggregation of the degree centrality reflection node of node in subrange, it is special to embody the part of node in a network Property.Under normal conditions, as soon as the degree centrality of node is bigger, and neighbor node is more, then the node in subrange just With quite high importance.However only using the degree centrality of node as evaluation index, local model in network can only be extracted The bigger word of interior degree is enclosed, to have ignored larger to network entire effect but spend the not high word of centrality.Close to center Property be a network characterization of overall importance, if node has very high close to centrality, illustrate this point distance Any other point is all nearest, is spatially also embodied on center.The different index of above two emphasis is tied It closes, set forth herein a kind of new pitch point importance evaluation index overall situation metric, we are by node v_iGlobal metric note Number CM_iIt indicates:

CM_i=α DC_i+βCC_i (5)

Wherein, DC_iFor nodes v_iDegree centrality, CC_iFor node v_iClose to centrality.α, β are adjustable ginseng Number, and alpha+beta=1.(α=0.4, β=0.6 in the present invention).

4000 web page news sample standard deviations herein derive from search dog corpus, we use python3.6 as experiment Tool, all character codes are UTF-8.We need sample and label when doing text classification, sample data source is at this In can be headline+news content, do to simplify that news content is used only herein, and the class label of the news then can be with From the URL of the news.Such as: gongyi.sohu.com can be seen that the news belongs to classification " public good " from the URL.This The ratio of training sample and test sample is 4:1 in experiment, i.e., using 3200 documents as training sample, 800 document conducts Test document.

2 data distribution of table

Table 3 tests environment

Evaluation result:

In text categorization task, usually using accurate rate P (Precision), recall rate R (Recall) and F1 value are made For evaluation index.Also use These parameters as evaluation index in this experiment.

Meaning of parameters is as shown in the table in above-mentioned formula.

4 meaning of parameters of table

The network characteristic of word, will name Entity recognition (Named Entities in the Chinese newsletter archive of present invention research Recognition, NER) it combines, proposes with complex network (Complex Networks, CN) characteristic in natural language Novel keyword abstraction algorithm --- the algorithm (NER-CN) based on NER combination Complex Networks Feature, algorithm distich first Son, which is labeled, is named Entity recognition analysis, then constructs text complex network and the global degree according to the node in network Magnitude carries out keyword abstraction.Keyword abstraction algorithm proposed by the invention is compared to conventional method in the accurate of text classification Rate has significant raising in the indexs such as recall rate and F1 value.

The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of news keyword extracting method based on NER and Complex Networks Feature, which comprises the following steps:

Step 1: collecting news content, news content is generated into listed files under original expectation file, filtering threshold, word are set The content that symbol number is less than filtering threshold is filtered, and for each corpus, canonical matches url and content；All news Content will be stored in respectively in each txt file according to the classification in url；

Step 3: sentence is labeled using name entity recognition method neural network based, then carries out date recognition, Name identification and place name identification, extract name entity important in text；

Step 4: building word complex network；The word that step 3 is obtained carries out digital coding, using coding result as node, Use distance 2 as the distance of word association relationship, i.e. distance has even side between the word within 2, recycles to each sentence Judgement；

V={ v₁,v₂…v_nIt is a set for having N number of node, (v_i,v_j) indicate node v_i∈ V and v_jSide between ∈ V, G (V, E) Be using V as node set, withFor the figure of line set, node v_iDegree centrality DC_iAre as follows:

Wherein, d_ijIt is node v_iTo node v_jDistance, in network, distance average L between all nodes with following formula into Row calculates:

It will be combined to obtain new pitch point importance evaluation index overall situation metric close to centrality and degree centrality, by node v_iGlobal metric mark CM_iIt indicates:

CM_i=α DC_i+βCC_i (5)

Wherein, DC_iFor nodes v_iDegree centrality, CC_iFor node v_iClose to centrality, α is that degree centrality is adjustable Parameter, β are to extract keyword according to obtained global metric close to centrality customized parameter, and alpha+beta=1.

2. the news keyword extracting method based on NER and Complex Networks Feature according to claim 1, it is characterised in that: Filtering threshold is 30 in step 1.

3. the news keyword extracting method based on NER and Complex Networks Feature according to claim 2, it is characterised in that: Name entity recognition method neural network based in step 3: sequence labelling is done using BiLSTM_CRF model, is embedded in using word It is embedded in word, since input layer, the level of model is successively look-up layers, LSTM layers two-way, CRF layers, output layer, look- Up layers by word x each in sentence_iIt is the dense word vector of low-dimensional by one-hot DUAL PROBLEMS OF VECTOR MAPPING, two-way LSTM layers automatically extracts sentence Subcharacter, the sequence labelling of CRF layers of progress Sentence-level；The part for being marked as " ns " is concentrated to construct place name using part-of-speech tagging It identifies corpus, names Entity recognition mode using based on BiLSTM-CRF model.

4. the news keyword extracting method based on NER and Complex Networks Feature according to claim 3, it is characterised in that: Centrality customized parameter α=0.4 is spent, close to centrality customized parameter β=0.6.