CN109885669A - A kind of text key word acquisition methods and system based on complex network - Google Patents

A kind of text key word acquisition methods and system based on complex network Download PDF

Info

Publication number
CN109885669A
CN109885669A CN201910090349.1A CN201910090349A CN109885669A CN 109885669 A CN109885669 A CN 109885669A CN 201910090349 A CN201910090349 A CN 201910090349A CN 109885669 A CN109885669 A CN 109885669A
Authority
CN
China
Prior art keywords
text
keyword
network structure
core
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910090349.1A
Other languages
Chinese (zh)
Inventor
郑坤
李旦
姚宏
刘超
董理君
康晓军
李新川
李圣文
梁庆中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201910090349.1A priority Critical patent/CN109885669A/en
Publication of CN109885669A publication Critical patent/CN109885669A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention provides a kind of text key word acquisition methods and system based on complex network, method includes: firstly, the text to keyword to be extracted is pre-processed with NLP tool;Then, converting network structure for pretreated text at the method for figure using text indicates;Finally, the network structure for the text that network structure indicates is decomposed using the k-core decomposition method in complex network, the most crucial vocabulary in the text of network structure expression, that is, required keyword are obtained, and then obtains all keywords of the text of keyword to be extracted.The beneficial effects of the present invention are: technical solution proposed by the invention improves the accuracy for obtaining text key word using the network structure of text from text structure.

Description

A kind of text key word acquisition methods and system based on complex network
Technical field
The present invention relates to text key words to extract field more particularly to a kind of text key word acquisition based on complex network Method and system.
Background technique
With the continuous development of computer network, there is a large amount of information to generate and propagate on the internet daily, wherein Just comprising a large amount of text information, however the energy of people is limited, it is impossible to which each text is all interested in or is had Time goes carefully to study carefully, if the content for having this when some way that can quickly understand certain text, judge whether be My interested content, can be time saving while getting amount to interested content as much as possible.This is also Producing the automatic demand for extracting text key word just can substantially understand the main contents of a text by keyword, from And judges this text whether it is necessary to carefully study carefully.
The keyword extraction of text also has text classification certain help, assists sentencing by the keyword extracted Whether disconnected two texts belong to same category.
Although keyword abstraction is widely used in many fields, various keyword classification methods are also mentioned Out, such as there are the tf-idf method based on word frequency, semantic-based method etc..But current keyword extracting method is also deposited In many problems.Main problem is exactly the problem of extracting the not high problem of accuracy and most critical.Secondly some, which are extracted, closes Keyword method also needs artificial constructed rule, expends a large amount of human resources, and Generalization Ability is poor.
Summary of the invention
Word frequency information is only individually considered in order to solve traditional keyword acquisition methods (such as tf-idf), and has ignored text This structural information causes keyword to obtain the problem of inaccuracy, and the present invention provides a kind of, and the text based on complex network is crucial Word acquisition methods and system, a kind of text key word acquisition methods based on complex network, mainly comprise the steps that
S101: the text for treating extracting keywords is pre-processed, and to remove the punctuate and stop words in text, is obtained pre- Treated text;
S102: using text at the method for figure, pretreated text is converted in the text of network structure expression;
S103: using the k-core decomposition method in complex network, carrying out resolution process to the text that network structure indicates, Obtain the keyword in the text of keyword to be extracted.
Further, in step S101, carrying out pretreated step using the text of keyword to be extracted includes:
S201: stem reduction treatment is carried out using the text that NLP tool treats extracting keywords, the text after being restored This;
S202: stop words processing and removal punctuate processing are removed respectively to the text after reduction, after obtaining pretreatment Text.
Further, in step S201, NLP tool is the CoreNLP tool of Standford.
Further, in step S102, text is at figure method using sequence at figure method.
Further, in step S103, using the k-core decomposition method in complex network, the text that network structure is indicated This progress resolution process, the step of obtaining the keyword in the text of keyword to be extracted include:
S301: the node that the text moderate that network structure is indicated is 1 all removes, with the text indicated network structure It is updated, and degree newly generated in text that updated network structure indicates is removed again for 1 node;So follow Ring, until there is no the nodes that degree is 1 in the text that final network structure indicates;
S302: the node that all degree removed in step S301 are 1 is concluded into 1-core group;
S303: the 2- in the text of network structure expression is successively concluded using step S301 and the same method of step S302 core,3-core,…,n-core;Wherein, n is greater than 0, the maximum value of the degree in text indicated for network structure;
S304: by (n-x)-core into n-core the corresponding vocabulary of all nodes as final keyword to be extracted The keyword of text;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand.
Further, a kind of text key word based on complex network obtains system, it is characterised in that: including with lower die Block:
Preprocessing module, the text for treating extracting keywords are pre-processed, to remove the punctuate in text and stop Word obtains pretreated text;
Text is at module, for, at the method for figure, converting network structure table for pretreated text using text The text shown;
Keyword extracting module, for using the k-core decomposition method in complex network, to the text of network structure expression This progress resolution process obtains the keyword in the text of keyword to be extracted.
Further, it in preprocessing module, is pre-processed using the text of keyword to be extracted, including with lower unit:
Stem reduction unit, the text for being treated extracting keywords using NLP tool are carried out stem reduction treatment, obtained Text after reduction;
Processing unit is obtained for being removed stop words processing and removal punctuate processing respectively to the text after reduction Pretreated text.
Further, in stem reduction unit, NLP tool is the CoreNLP tool of Standford.
Further, text is in module, and text is at figure method using sequence at figure method.
Further, in keyword extracting module, using the k-core decomposition method in complex network, to network structure table The text shown carries out resolution process, obtains the keyword in the text of keyword to be extracted, including with lower unit:
Node removal unit, the node that the text moderate for indicating network structure is 1 all removes, to network knot The node that newly generated degree is 1 in the text that the text that structure indicates is updated, and updated network structure is indicated is again Removal;So circulation, until there is no the nodes that degree is 1 in the text that final network structure indicates;
Node concludes unit, for concluding all degree removed in step S301 for 1 node into 1-core group;
Circular treatment unit is indicated for successively concluding network structure using step S301 and the same method of step S302 Text in 2-core, 3-core ..., n-core;Wherein, n is greater than 0, and the degree in text indicated for network structure is most Big value;
Keyword extracting unit, for will (n-x)-core all nodes into n-core corresponding vocabulary as finally The keyword of the text of keyword to be extracted;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand It is fixed.
Technical solution provided by the invention has the benefit that technical solution proposed by the invention from text structure It sets out, the accuracy for obtaining text key word is improved using the network structure of text.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:
Fig. 1 is a kind of flow chart of the text key word acquisition methods based on complex network in the embodiment of the present invention;
Fig. 2 is a kind of module composition signal of the text key word acquisition system based on complex network in the embodiment of the present invention Figure;
Fig. 3 is the exemplary diagram for the text that network structure indicates in the embodiment of the present invention;
Fig. 4 is that node concludes the exemplary diagram after summarizing in text in the embodiment of the present invention.
Specific embodiment
For a clearer understanding of the technical characteristics, objects and effects of the present invention, now control attached drawing is described in detail A specific embodiment of the invention.
The embodiment provides a kind of text key word acquisition methods and system based on complex network.
Referring to FIG. 1, Fig. 1 is a kind of stream of the text key word acquisition methods based on complex network in the embodiment of the present invention Cheng Tu specifically comprises the following steps:
S101: the text for treating extracting keywords is pre-processed, and to remove the punctuate and stop words in text, is obtained pre- Treated text;
S102: using text at the method for figure, pretreated text is converted in the text of network structure expression;
S103: using the k-core decomposition method in complex network, carrying out resolution process to the text that network structure indicates, Obtain the keyword in the text of keyword to be extracted.
In step S101, carrying out pretreated step using the text of keyword to be extracted includes:
S201: stem reduction treatment is carried out using the text that NLP tool treats extracting keywords, the text after being restored This;For example " was " is reduced into " is ", " does " is reduced into " do " etc.;
S202: stop words processing is removed respectively (such as " is, a, an " in English, in Chinese to the text after reduction " some, " etc.) and removal punctuate processing, obtain pretreated text.
In step S201, NLP tool is the CoreNLP tool of Standford.
In step S102, for text at figure method using sequentially at figure method, i.e., each vocabulary is a node in network, vocabulary A line is successively drawn according to putting in order in the text to be connected with subsequent vocabulary.
Citing: including two sentences in the text of certain section of keyword to be extracted:
Sentence 1: I likes the night scene in Wuhan;
Sentence 2: he does not like the traffic in Wuhan;
Using text at the method for figure, then " Wuhan " and " night scene " is linked to be a line, and " Wuhan " and " traffic " is linked to be one again Side;Vocabulary " I ", " he ", " liking ", " not liking ", " Wuhan ", " night scene ", " traffic " are node, and node " Wuhan " Degree be 2, the degree of other nodes is 1.
In step S103, using the k-core decomposition method in complex network, the text indicated network structure divides Solution processing, the step of obtaining the keyword in the text of keyword to be extracted include:
S301: the node that the text moderate that network structure is indicated is 1 all removes, with the text indicated network structure It is updated, and degree newly generated in text that updated network structure indicates is removed again for 1 node;So follow Ring, until there is no the nodes that degree is 1 in the text that final network structure indicates;
S302: the node that all degree removed in step S301 are 1 is concluded into 1-core group;
S303: the 2- in the text of network structure expression is successively concluded using step S301 and the same method of step S302 core,3-core,…,n-core;Wherein, n is greater than 0, the maximum value of the degree in text indicated for network structure;
S304: by (n-x)-core into n-core the corresponding vocabulary of all nodes as final keyword to be extracted The keyword of text;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand.
Referring to Fig. 2, a kind of text key word based on complex network obtains the module of system in the embodiment of the present invention Composition schematic diagram, including sequentially connected: preprocessing module 11, text are at module 12 and keyword extracting module 13;
Preprocessing module 11, the text for treating extracting keywords are pre-processed, with remove the punctuate in text and Stop words obtains pretreated text;
Text is at module 12, for, at the method for figure, converting network structure for pretreated text using text The text of expression;
Keyword extracting module 13, for indicating network structure using the k-core decomposition method in complex network Text carries out resolution process, obtains the keyword in the text of keyword to be extracted.
In preprocessing module 11, pre-processed using the text of keyword to be extracted, including with lower unit:
Stem reduction unit, the text for being treated extracting keywords using NLP tool are carried out stem reduction treatment, obtained Text after reduction;
Processing unit is obtained for being removed stop words processing and removal punctuate processing respectively to the text after reduction Pretreated text.
In stem reduction unit, NLP tool is the CoreNLP tool of Standford.
Text is in module 12, and text is at figure method using sequence at figure method.
In keyword extracting module 13, using the k-core decomposition method in complex network, the text that network structure is indicated This progress resolution process obtains the keyword in the text of keyword to be extracted, including with lower unit:
Node removal unit, the node that the text moderate for indicating network structure is 1 all removes, to network knot The node that newly generated degree is 1 in the text that the text that structure indicates is updated, and updated network structure is indicated is again Removal;So circulation, until there is no the nodes that degree is 1 in the text that final network structure indicates;
Node concludes unit, for concluding all degree removed in step S301 for 1 node into 1-core group;
Circular treatment unit is indicated for successively concluding network structure using step S301 and the same method of step S302 Text in 2-core, 3-core ..., n-core;Wherein, n is greater than 0, and the degree in text indicated for network structure is most Big value;
Keyword extracting unit, for will (n-x)-core all nodes into n-core corresponding vocabulary as finally The keyword of the text of keyword to be extracted;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand It is fixed.
In the present invention is implemented, text of the Oxford dictionary of English as keyword to be extracted is selected, is illustrated, It is specific as follows:
Firstly, having carried out participle and stem reduction pretreatment, stem to text using the CoreNLP tool of Standford Reduction is exactly such as " was " to be reduced into " is ", and " does " is reduced into " do " etc.;Then the text after reduction is removed and is stopped The processing of word and punctuate is as shown in table 1 the example of a Text Pretreatment:
1. Text Pretreatment effect example of table
Then pretreated text is converted at the method for figure using sequence the text of network structure expression, such as Fig. 3 It is shown, the example of the text of network structure expression is converted into for one.
Finally, the text indicated network structure obtains pass to be extracted using the k-core decomposition method in complex network Most crucial part in the text of keyword, that is, required keyword, the specific steps are as follows:
(1) degree all in figure are removed for 1 node, is then updated, having some new node degree becomes 1 again at this time, These nodes are equally removed, until the node of removals all in the above process is classified as 1- there is no the node that degree is 1 in network In core group;
(2) 2-core, 3-core in figure are successively extracted using method same as (1), until final network structure table There is no nodes in the text shown;As shown in figure 4, the part of oblique line background is 1-core, the part of diamond shape background is 2-core, The part of point background is 3-core;Wherein, the corresponding vocabulary of all nodes in 3-core is the text that network structure indicates In most crucial part, that is, the keyword in the text of keyword to be extracted.
The beneficial effects of the present invention are: technical solution proposed by the invention utilizes the net of text from text structure Network structure improves the accuracy for obtaining text key word.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of text key word acquisition methods based on complex network, it is characterised in that: the following steps are included:
S101: the text for treating extracting keywords is pre-processed, and to remove the punctuate and stop words in text, is pre-processed Text afterwards;
S102: using text at the method for figure, pretreated text is converted in the text of network structure expression;
S103: using the k-core decomposition method in complex network, resolution process is carried out to the text that network structure indicates, is obtained Keyword in the text of keyword to be extracted.
2. a kind of text key word acquisition methods based on complex network as described in claim 1, it is characterised in that: step In S101, carrying out pretreated step using the text of keyword to be extracted includes:
S201: stem reduction treatment is carried out using the text that NLP tool treats extracting keywords, the text after being restored;
S202: stop words processing and removal punctuate processing are removed to the text after reduction respectively, obtain pretreated text This.
3. a kind of text key word acquisition methods based on complex network as claimed in claim 2, it is characterised in that: step In S201, NLP tool is the CoreNLP tool of Standford.
4. a kind of text key word acquisition methods based on complex network as described in claim 1, it is characterised in that: step In S102, text is at figure method using sequence at figure method.
5. a kind of text key word acquisition methods based on complex network as described in claim 1, it is characterised in that: step In S103, using the k-core decomposition method in complex network, resolution process is carried out to the text that network structure indicates, obtain to The step of keyword in the text of extracting keywords includes:
S301: the node that the text moderate that network structure is indicated is 1 all removes, and is carried out with the text indicated network structure It updates, and degree newly generated in text that updated network structure indicates is removed again for 1 node;So circulation, directly There is no the nodes that degree is 1 in the text indicated to final network structure;
S302: the node that all degree removed in step S301 are 1 is concluded into 1-core group;
S303: the 2- in the text of network structure expression is successively concluded using step S301 and the same method of step S302 core,3-core,…,n-core;Wherein, n is greater than 0, the maximum value of the degree in text indicated for network structure;
S304: by (n-x)-core into n-core text of the corresponding vocabulary of all nodes as final keyword to be extracted Keyword;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand.
6. a kind of text key word based on complex network obtains system, it is characterised in that: comprise the following modules:
Preprocessing module, the text for treating extracting keywords are pre-processed, to remove the punctuate and stop words in text, Obtain pretreated text;
Text is at module, for the method using text at figure, converts network structure expression for pretreated text Text;
Keyword extracting module, for using the k-core decomposition method in complex network, text that network structure is indicated into Row resolution process obtains the keyword in the text of keyword to be extracted.
7. a kind of text key word based on complex network as claimed in claim 6 obtains system, it is characterised in that: pretreatment In module, pre-processed using the text of keyword to be extracted, including with lower unit:
Stem reduction unit, the text for being treated extracting keywords using NLP tool are carried out stem reduction treatment, are restored Text afterwards;
Processing unit obtains pre- place for being removed stop words processing and removal punctuate processing respectively to the text after reduction Text after reason.
8. a kind of text key word based on complex network as claimed in claim 7 obtains system, it is characterised in that: stem is also In former unit, NLP tool is the CoreNLP tool of Standford.
9. a kind of text key word based on complex network as claimed in claim 6 obtains system, it is characterised in that: text at In module, text is at figure method using sequence at figure method.
10. a kind of text key word based on complex network as claimed in claim 6 obtains system, it is characterised in that: crucial In word extraction module, using the k-core decomposition method in complex network, resolution process is carried out to the text that network structure indicates, The keyword in the text of keyword to be extracted is obtained, including with lower unit:
Node removal unit, the node that the text moderate for indicating network structure is 1 all removes, to network structure table The text shown is updated, and degree newly generated in text that updated network structure indicates is removed again for 1 node; So circulation, until there is no the nodes that degree is 1 in the text that final network structure indicates;
Node concludes unit, for concluding all degree removed in step S301 for 1 node into 1-core group;
Circular treatment unit, for successively concluding the text of network structure expression using step S301 and the same method of step S302 2-core, 3-core in this ..., n-core;Wherein, n is greater than 0, the maximum value of the degree in text indicated for network structure;
Keyword extracting unit, for will the corresponding vocabulary of (n-x)-core all nodes into n-core as finally wait take out Take the keyword of the text of keyword;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand.
CN201910090349.1A 2019-01-30 2019-01-30 A kind of text key word acquisition methods and system based on complex network Pending CN109885669A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910090349.1A CN109885669A (en) 2019-01-30 2019-01-30 A kind of text key word acquisition methods and system based on complex network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910090349.1A CN109885669A (en) 2019-01-30 2019-01-30 A kind of text key word acquisition methods and system based on complex network

Publications (1)

Publication Number Publication Date
CN109885669A true CN109885669A (en) 2019-06-14

Family

ID=66927352

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910090349.1A Pending CN109885669A (en) 2019-01-30 2019-01-30 A kind of text key word acquisition methods and system based on complex network

Country Status (1)

Country Link
CN (1) CN109885669A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460796A (en) * 2020-03-30 2020-07-28 北京航空航天大学 Accidental sensitive word discovery method based on word network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933032A (en) * 2015-06-29 2015-09-23 电子科技大学 Method for extracting keywords of blog based on complex network
CN106844500A (en) * 2016-12-26 2017-06-13 深圳大学 A kind of k core truss community models and decomposition, searching algorithm
CN107784087A (en) * 2017-10-09 2018-03-09 东软集团股份有限公司 A kind of hot word determines method, apparatus and equipment
CN108763687A (en) * 2018-05-17 2018-11-06 重庆大学 The analysis method of public traffic network topological attribute and space attribute

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933032A (en) * 2015-06-29 2015-09-23 电子科技大学 Method for extracting keywords of blog based on complex network
CN106844500A (en) * 2016-12-26 2017-06-13 深圳大学 A kind of k core truss community models and decomposition, searching algorithm
CN107784087A (en) * 2017-10-09 2018-03-09 东软集团股份有限公司 A kind of hot word determines method, apparatus and equipment
CN108763687A (en) * 2018-05-17 2018-11-06 重庆大学 The analysis method of public traffic network topological attribute and space attribute

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
于群: "基于语义的中文文本特征提取方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460796A (en) * 2020-03-30 2020-07-28 北京航空航天大学 Accidental sensitive word discovery method based on word network

Similar Documents

Publication Publication Date Title
CN103123618B (en) Text similarity acquisition methods and device
CN105930509B (en) Field concept based on statistics and template matching extracts refined method and system automatically
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN103778243B (en) Domain term extraction method
CN106484767A (en) A kind of event extraction method across media
CN107766371A (en) A kind of text message sorting technique and its device
CN107273474A (en) Autoabstract abstracting method and system based on latent semantic analysis
CN101013443A (en) Intelligent word input method and input method system and updating method thereof
CN103106189B (en) A kind of method and apparatus excavating synonym attribute word
CN103207856A (en) Ontology concept and hierarchical relation generation method
CN106055623A (en) Cross-language recommendation method and system
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN106919557A (en) A kind of document vector generation method of combination topic model
CN108563667A (en) Hot issue acquisition system based on new word identification and its method
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN110188359B (en) Text entity extraction method
CN110674298B (en) Deep learning mixed topic model construction method
CN109308317A (en) A kind of hot spot word extracting method of the non-structured text based on cluster
CN106610952A (en) Mixed text feature word extraction method
CN108712466A (en) A kind of semanteme sparse Web service discovery method being embedded in based on Gaussian ATM and word
CN102270244B (en) Method for quickly extracting webpage content key words based on core sentence
CN106570120A (en) Process for realizing searching engine optimization through improved keyword optimization
Devika et al. A semantic graph-based keyword extraction model using ranking method on big social data
Wang et al. Constructing service network via classification and annotation
CN109885669A (en) A kind of text key word acquisition methods and system based on complex network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190614