CN109885669A - A kind of text key word acquisition methods and system based on complex network - Google Patents
A kind of text key word acquisition methods and system based on complex network Download PDFInfo
- Publication number
- CN109885669A CN109885669A CN201910090349.1A CN201910090349A CN109885669A CN 109885669 A CN109885669 A CN 109885669A CN 201910090349 A CN201910090349 A CN 201910090349A CN 109885669 A CN109885669 A CN 109885669A
- Authority
- CN
- China
- Prior art keywords
- text
- keyword
- network structure
- core
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Machine Translation (AREA)
Abstract
The present invention provides a kind of text key word acquisition methods and system based on complex network, method includes: firstly, the text to keyword to be extracted is pre-processed with NLP tool;Then, converting network structure for pretreated text at the method for figure using text indicates;Finally, the network structure for the text that network structure indicates is decomposed using the k-core decomposition method in complex network, the most crucial vocabulary in the text of network structure expression, that is, required keyword are obtained, and then obtains all keywords of the text of keyword to be extracted.The beneficial effects of the present invention are: technical solution proposed by the invention improves the accuracy for obtaining text key word using the network structure of text from text structure.
Description
Technical field
The present invention relates to text key words to extract field more particularly to a kind of text key word acquisition based on complex network
Method and system.
Background technique
With the continuous development of computer network, there is a large amount of information to generate and propagate on the internet daily, wherein
Just comprising a large amount of text information, however the energy of people is limited, it is impossible to which each text is all interested in or is had
Time goes carefully to study carefully, if the content for having this when some way that can quickly understand certain text, judge whether be
My interested content, can be time saving while getting amount to interested content as much as possible.This is also
Producing the automatic demand for extracting text key word just can substantially understand the main contents of a text by keyword, from
And judges this text whether it is necessary to carefully study carefully.
The keyword extraction of text also has text classification certain help, assists sentencing by the keyword extracted
Whether disconnected two texts belong to same category.
Although keyword abstraction is widely used in many fields, various keyword classification methods are also mentioned
Out, such as there are the tf-idf method based on word frequency, semantic-based method etc..But current keyword extracting method is also deposited
In many problems.Main problem is exactly the problem of extracting the not high problem of accuracy and most critical.Secondly some, which are extracted, closes
Keyword method also needs artificial constructed rule, expends a large amount of human resources, and Generalization Ability is poor.
Summary of the invention
Word frequency information is only individually considered in order to solve traditional keyword acquisition methods (such as tf-idf), and has ignored text
This structural information causes keyword to obtain the problem of inaccuracy, and the present invention provides a kind of, and the text based on complex network is crucial
Word acquisition methods and system, a kind of text key word acquisition methods based on complex network, mainly comprise the steps that
S101: the text for treating extracting keywords is pre-processed, and to remove the punctuate and stop words in text, is obtained pre-
Treated text;
S102: using text at the method for figure, pretreated text is converted in the text of network structure expression;
S103: using the k-core decomposition method in complex network, carrying out resolution process to the text that network structure indicates,
Obtain the keyword in the text of keyword to be extracted.
Further, in step S101, carrying out pretreated step using the text of keyword to be extracted includes:
S201: stem reduction treatment is carried out using the text that NLP tool treats extracting keywords, the text after being restored
This;
S202: stop words processing and removal punctuate processing are removed respectively to the text after reduction, after obtaining pretreatment
Text.
Further, in step S201, NLP tool is the CoreNLP tool of Standford.
Further, in step S102, text is at figure method using sequence at figure method.
Further, in step S103, using the k-core decomposition method in complex network, the text that network structure is indicated
This progress resolution process, the step of obtaining the keyword in the text of keyword to be extracted include:
S301: the node that the text moderate that network structure is indicated is 1 all removes, with the text indicated network structure
It is updated, and degree newly generated in text that updated network structure indicates is removed again for 1 node;So follow
Ring, until there is no the nodes that degree is 1 in the text that final network structure indicates;
S302: the node that all degree removed in step S301 are 1 is concluded into 1-core group;
S303: the 2- in the text of network structure expression is successively concluded using step S301 and the same method of step S302
core,3-core,…,n-core;Wherein, n is greater than 0, the maximum value of the degree in text indicated for network structure;
S304: by (n-x)-core into n-core the corresponding vocabulary of all nodes as final keyword to be extracted
The keyword of text;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand.
Further, a kind of text key word based on complex network obtains system, it is characterised in that: including with lower die
Block:
Preprocessing module, the text for treating extracting keywords are pre-processed, to remove the punctuate in text and stop
Word obtains pretreated text;
Text is at module, for, at the method for figure, converting network structure table for pretreated text using text
The text shown;
Keyword extracting module, for using the k-core decomposition method in complex network, to the text of network structure expression
This progress resolution process obtains the keyword in the text of keyword to be extracted.
Further, it in preprocessing module, is pre-processed using the text of keyword to be extracted, including with lower unit:
Stem reduction unit, the text for being treated extracting keywords using NLP tool are carried out stem reduction treatment, obtained
Text after reduction;
Processing unit is obtained for being removed stop words processing and removal punctuate processing respectively to the text after reduction
Pretreated text.
Further, in stem reduction unit, NLP tool is the CoreNLP tool of Standford.
Further, text is in module, and text is at figure method using sequence at figure method.
Further, in keyword extracting module, using the k-core decomposition method in complex network, to network structure table
The text shown carries out resolution process, obtains the keyword in the text of keyword to be extracted, including with lower unit:
Node removal unit, the node that the text moderate for indicating network structure is 1 all removes, to network knot
The node that newly generated degree is 1 in the text that the text that structure indicates is updated, and updated network structure is indicated is again
Removal;So circulation, until there is no the nodes that degree is 1 in the text that final network structure indicates;
Node concludes unit, for concluding all degree removed in step S301 for 1 node into 1-core group;
Circular treatment unit is indicated for successively concluding network structure using step S301 and the same method of step S302
Text in 2-core, 3-core ..., n-core;Wherein, n is greater than 0, and the degree in text indicated for network structure is most
Big value;
Keyword extracting unit, for will (n-x)-core all nodes into n-core corresponding vocabulary as finally
The keyword of the text of keyword to be extracted;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand
It is fixed.
Technical solution provided by the invention has the benefit that technical solution proposed by the invention from text structure
It sets out, the accuracy for obtaining text key word is improved using the network structure of text.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:
Fig. 1 is a kind of flow chart of the text key word acquisition methods based on complex network in the embodiment of the present invention;
Fig. 2 is a kind of module composition signal of the text key word acquisition system based on complex network in the embodiment of the present invention
Figure;
Fig. 3 is the exemplary diagram for the text that network structure indicates in the embodiment of the present invention;
Fig. 4 is that node concludes the exemplary diagram after summarizing in text in the embodiment of the present invention.
Specific embodiment
For a clearer understanding of the technical characteristics, objects and effects of the present invention, now control attached drawing is described in detail
A specific embodiment of the invention.
The embodiment provides a kind of text key word acquisition methods and system based on complex network.
Referring to FIG. 1, Fig. 1 is a kind of stream of the text key word acquisition methods based on complex network in the embodiment of the present invention
Cheng Tu specifically comprises the following steps:
S101: the text for treating extracting keywords is pre-processed, and to remove the punctuate and stop words in text, is obtained pre-
Treated text;
S102: using text at the method for figure, pretreated text is converted in the text of network structure expression;
S103: using the k-core decomposition method in complex network, carrying out resolution process to the text that network structure indicates,
Obtain the keyword in the text of keyword to be extracted.
In step S101, carrying out pretreated step using the text of keyword to be extracted includes:
S201: stem reduction treatment is carried out using the text that NLP tool treats extracting keywords, the text after being restored
This;For example " was " is reduced into " is ", " does " is reduced into " do " etc.;
S202: stop words processing is removed respectively (such as " is, a, an " in English, in Chinese to the text after reduction
" some, " etc.) and removal punctuate processing, obtain pretreated text.
In step S201, NLP tool is the CoreNLP tool of Standford.
In step S102, for text at figure method using sequentially at figure method, i.e., each vocabulary is a node in network, vocabulary
A line is successively drawn according to putting in order in the text to be connected with subsequent vocabulary.
Citing: including two sentences in the text of certain section of keyword to be extracted:
Sentence 1: I likes the night scene in Wuhan;
Sentence 2: he does not like the traffic in Wuhan;
Using text at the method for figure, then " Wuhan " and " night scene " is linked to be a line, and " Wuhan " and " traffic " is linked to be one again
Side;Vocabulary " I ", " he ", " liking ", " not liking ", " Wuhan ", " night scene ", " traffic " are node, and node " Wuhan "
Degree be 2, the degree of other nodes is 1.
In step S103, using the k-core decomposition method in complex network, the text indicated network structure divides
Solution processing, the step of obtaining the keyword in the text of keyword to be extracted include:
S301: the node that the text moderate that network structure is indicated is 1 all removes, with the text indicated network structure
It is updated, and degree newly generated in text that updated network structure indicates is removed again for 1 node;So follow
Ring, until there is no the nodes that degree is 1 in the text that final network structure indicates;
S302: the node that all degree removed in step S301 are 1 is concluded into 1-core group;
S303: the 2- in the text of network structure expression is successively concluded using step S301 and the same method of step S302
core,3-core,…,n-core;Wherein, n is greater than 0, the maximum value of the degree in text indicated for network structure;
S304: by (n-x)-core into n-core the corresponding vocabulary of all nodes as final keyword to be extracted
The keyword of text;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand.
Referring to Fig. 2, a kind of text key word based on complex network obtains the module of system in the embodiment of the present invention
Composition schematic diagram, including sequentially connected: preprocessing module 11, text are at module 12 and keyword extracting module 13;
Preprocessing module 11, the text for treating extracting keywords are pre-processed, with remove the punctuate in text and
Stop words obtains pretreated text;
Text is at module 12, for, at the method for figure, converting network structure for pretreated text using text
The text of expression;
Keyword extracting module 13, for indicating network structure using the k-core decomposition method in complex network
Text carries out resolution process, obtains the keyword in the text of keyword to be extracted.
In preprocessing module 11, pre-processed using the text of keyword to be extracted, including with lower unit:
Stem reduction unit, the text for being treated extracting keywords using NLP tool are carried out stem reduction treatment, obtained
Text after reduction;
Processing unit is obtained for being removed stop words processing and removal punctuate processing respectively to the text after reduction
Pretreated text.
In stem reduction unit, NLP tool is the CoreNLP tool of Standford.
Text is in module 12, and text is at figure method using sequence at figure method.
In keyword extracting module 13, using the k-core decomposition method in complex network, the text that network structure is indicated
This progress resolution process obtains the keyword in the text of keyword to be extracted, including with lower unit:
Node removal unit, the node that the text moderate for indicating network structure is 1 all removes, to network knot
The node that newly generated degree is 1 in the text that the text that structure indicates is updated, and updated network structure is indicated is again
Removal;So circulation, until there is no the nodes that degree is 1 in the text that final network structure indicates;
Node concludes unit, for concluding all degree removed in step S301 for 1 node into 1-core group;
Circular treatment unit is indicated for successively concluding network structure using step S301 and the same method of step S302
Text in 2-core, 3-core ..., n-core;Wherein, n is greater than 0, and the degree in text indicated for network structure is most
Big value;
Keyword extracting unit, for will (n-x)-core all nodes into n-core corresponding vocabulary as finally
The keyword of the text of keyword to be extracted;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand
It is fixed.
In the present invention is implemented, text of the Oxford dictionary of English as keyword to be extracted is selected, is illustrated,
It is specific as follows:
Firstly, having carried out participle and stem reduction pretreatment, stem to text using the CoreNLP tool of Standford
Reduction is exactly such as " was " to be reduced into " is ", and " does " is reduced into " do " etc.;Then the text after reduction is removed and is stopped
The processing of word and punctuate is as shown in table 1 the example of a Text Pretreatment:
1. Text Pretreatment effect example of table
Then pretreated text is converted at the method for figure using sequence the text of network structure expression, such as Fig. 3
It is shown, the example of the text of network structure expression is converted into for one.
Finally, the text indicated network structure obtains pass to be extracted using the k-core decomposition method in complex network
Most crucial part in the text of keyword, that is, required keyword, the specific steps are as follows:
(1) degree all in figure are removed for 1 node, is then updated, having some new node degree becomes 1 again at this time,
These nodes are equally removed, until the node of removals all in the above process is classified as 1- there is no the node that degree is 1 in network
In core group;
(2) 2-core, 3-core in figure are successively extracted using method same as (1), until final network structure table
There is no nodes in the text shown;As shown in figure 4, the part of oblique line background is 1-core, the part of diamond shape background is 2-core,
The part of point background is 3-core;Wherein, the corresponding vocabulary of all nodes in 3-core is the text that network structure indicates
In most crucial part, that is, the keyword in the text of keyword to be extracted.
The beneficial effects of the present invention are: technical solution proposed by the invention utilizes the net of text from text structure
Network structure improves the accuracy for obtaining text key word.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of text key word acquisition methods based on complex network, it is characterised in that: the following steps are included:
S101: the text for treating extracting keywords is pre-processed, and to remove the punctuate and stop words in text, is pre-processed
Text afterwards;
S102: using text at the method for figure, pretreated text is converted in the text of network structure expression;
S103: using the k-core decomposition method in complex network, resolution process is carried out to the text that network structure indicates, is obtained
Keyword in the text of keyword to be extracted.
2. a kind of text key word acquisition methods based on complex network as described in claim 1, it is characterised in that: step
In S101, carrying out pretreated step using the text of keyword to be extracted includes:
S201: stem reduction treatment is carried out using the text that NLP tool treats extracting keywords, the text after being restored;
S202: stop words processing and removal punctuate processing are removed to the text after reduction respectively, obtain pretreated text
This.
3. a kind of text key word acquisition methods based on complex network as claimed in claim 2, it is characterised in that: step
In S201, NLP tool is the CoreNLP tool of Standford.
4. a kind of text key word acquisition methods based on complex network as described in claim 1, it is characterised in that: step
In S102, text is at figure method using sequence at figure method.
5. a kind of text key word acquisition methods based on complex network as described in claim 1, it is characterised in that: step
In S103, using the k-core decomposition method in complex network, resolution process is carried out to the text that network structure indicates, obtain to
The step of keyword in the text of extracting keywords includes:
S301: the node that the text moderate that network structure is indicated is 1 all removes, and is carried out with the text indicated network structure
It updates, and degree newly generated in text that updated network structure indicates is removed again for 1 node;So circulation, directly
There is no the nodes that degree is 1 in the text indicated to final network structure;
S302: the node that all degree removed in step S301 are 1 is concluded into 1-core group;
S303: the 2- in the text of network structure expression is successively concluded using step S301 and the same method of step S302
core,3-core,…,n-core;Wherein, n is greater than 0, the maximum value of the degree in text indicated for network structure;
S304: by (n-x)-core into n-core text of the corresponding vocabulary of all nodes as final keyword to be extracted
Keyword;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand.
6. a kind of text key word based on complex network obtains system, it is characterised in that: comprise the following modules:
Preprocessing module, the text for treating extracting keywords are pre-processed, to remove the punctuate and stop words in text,
Obtain pretreated text;
Text is at module, for the method using text at figure, converts network structure expression for pretreated text
Text;
Keyword extracting module, for using the k-core decomposition method in complex network, text that network structure is indicated into
Row resolution process obtains the keyword in the text of keyword to be extracted.
7. a kind of text key word based on complex network as claimed in claim 6 obtains system, it is characterised in that: pretreatment
In module, pre-processed using the text of keyword to be extracted, including with lower unit:
Stem reduction unit, the text for being treated extracting keywords using NLP tool are carried out stem reduction treatment, are restored
Text afterwards;
Processing unit obtains pre- place for being removed stop words processing and removal punctuate processing respectively to the text after reduction
Text after reason.
8. a kind of text key word based on complex network as claimed in claim 7 obtains system, it is characterised in that: stem is also
In former unit, NLP tool is the CoreNLP tool of Standford.
9. a kind of text key word based on complex network as claimed in claim 6 obtains system, it is characterised in that: text at
In module, text is at figure method using sequence at figure method.
10. a kind of text key word based on complex network as claimed in claim 6 obtains system, it is characterised in that: crucial
In word extraction module, using the k-core decomposition method in complex network, resolution process is carried out to the text that network structure indicates,
The keyword in the text of keyword to be extracted is obtained, including with lower unit:
Node removal unit, the node that the text moderate for indicating network structure is 1 all removes, to network structure table
The text shown is updated, and degree newly generated in text that updated network structure indicates is removed again for 1 node;
So circulation, until there is no the nodes that degree is 1 in the text that final network structure indicates;
Node concludes unit, for concluding all degree removed in step S301 for 1 node into 1-core group;
Circular treatment unit, for successively concluding the text of network structure expression using step S301 and the same method of step S302
2-core, 3-core in this ..., n-core;Wherein, n is greater than 0, the maximum value of the degree in text indicated for network structure;
Keyword extracting unit, for will the corresponding vocabulary of (n-x)-core all nodes into n-core as finally wait take out
Take the keyword of the text of keyword;Wherein, x is more than or equal to 0, is set according to practical keyword extraction demand.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910090349.1A CN109885669A (en) | 2019-01-30 | 2019-01-30 | A kind of text key word acquisition methods and system based on complex network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910090349.1A CN109885669A (en) | 2019-01-30 | 2019-01-30 | A kind of text key word acquisition methods and system based on complex network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109885669A true CN109885669A (en) | 2019-06-14 |
Family
ID=66927352
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910090349.1A Pending CN109885669A (en) | 2019-01-30 | 2019-01-30 | A kind of text key word acquisition methods and system based on complex network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109885669A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460796A (en) * | 2020-03-30 | 2020-07-28 | 北京航空航天大学 | Accidental sensitive word discovery method based on word network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933032A (en) * | 2015-06-29 | 2015-09-23 | 电子科技大学 | Method for extracting keywords of blog based on complex network |
CN106844500A (en) * | 2016-12-26 | 2017-06-13 | 深圳大学 | A kind of k core truss community models and decomposition, searching algorithm |
CN107784087A (en) * | 2017-10-09 | 2018-03-09 | 东软集团股份有限公司 | A kind of hot word determines method, apparatus and equipment |
CN108763687A (en) * | 2018-05-17 | 2018-11-06 | 重庆大学 | The analysis method of public traffic network topological attribute and space attribute |
-
2019
- 2019-01-30 CN CN201910090349.1A patent/CN109885669A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933032A (en) * | 2015-06-29 | 2015-09-23 | 电子科技大学 | Method for extracting keywords of blog based on complex network |
CN106844500A (en) * | 2016-12-26 | 2017-06-13 | 深圳大学 | A kind of k core truss community models and decomposition, searching algorithm |
CN107784087A (en) * | 2017-10-09 | 2018-03-09 | 东软集团股份有限公司 | A kind of hot word determines method, apparatus and equipment |
CN108763687A (en) * | 2018-05-17 | 2018-11-06 | 重庆大学 | The analysis method of public traffic network topological attribute and space attribute |
Non-Patent Citations (1)
Title |
---|
于群: "基于语义的中文文本特征提取方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460796A (en) * | 2020-03-30 | 2020-07-28 | 北京航空航天大学 | Accidental sensitive word discovery method based on word network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103123618B (en) | Text similarity acquisition methods and device | |
CN105930509B (en) | Field concept based on statistics and template matching extracts refined method and system automatically | |
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
CN103778243B (en) | Domain term extraction method | |
CN106484767A (en) | A kind of event extraction method across media | |
CN107766371A (en) | A kind of text message sorting technique and its device | |
CN107273474A (en) | Autoabstract abstracting method and system based on latent semantic analysis | |
CN101013443A (en) | Intelligent word input method and input method system and updating method thereof | |
CN103106189B (en) | A kind of method and apparatus excavating synonym attribute word | |
CN103207856A (en) | Ontology concept and hierarchical relation generation method | |
CN106055623A (en) | Cross-language recommendation method and system | |
CN103324700A (en) | Noumenon concept attribute learning method based on Web information | |
CN106919557A (en) | A kind of document vector generation method of combination topic model | |
CN108563667A (en) | Hot issue acquisition system based on new word identification and its method | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
CN110188359B (en) | Text entity extraction method | |
CN110674298B (en) | Deep learning mixed topic model construction method | |
CN109308317A (en) | A kind of hot spot word extracting method of the non-structured text based on cluster | |
CN106610952A (en) | Mixed text feature word extraction method | |
CN108712466A (en) | A kind of semanteme sparse Web service discovery method being embedded in based on Gaussian ATM and word | |
CN102270244B (en) | Method for quickly extracting webpage content key words based on core sentence | |
CN106570120A (en) | Process for realizing searching engine optimization through improved keyword optimization | |
Devika et al. | A semantic graph-based keyword extraction model using ranking method on big social data | |
Wang et al. | Constructing service network via classification and annotation | |
CN109885669A (en) | A kind of text key word acquisition methods and system based on complex network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190614 |