CN106156041B - Hot information finds method and system - Google Patents

Hot information finds method and system Download PDF

Info

Publication number
CN106156041B
CN106156041B CN201510137773.9A CN201510137773A CN106156041B CN 106156041 B CN106156041 B CN 106156041B CN 201510137773 A CN201510137773 A CN 201510137773A CN 106156041 B CN106156041 B CN 106156041B
Authority
CN
China
Prior art keywords
text
node
interdependent
processed
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510137773.9A
Other languages
Chinese (zh)
Other versions
CN106156041A (en
Inventor
吴及
侯晋峰
胡国平
吕萍
王影
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
iFlytek Co Ltd
Original Assignee
Tsinghua University
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, iFlytek Co Ltd filed Critical Tsinghua University
Priority to CN201510137773.9A priority Critical patent/CN106156041B/en
Publication of CN106156041A publication Critical patent/CN106156041A/en
Application granted granted Critical
Publication of CN106156041B publication Critical patent/CN106156041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a kind of hot informations to find method and system, this method comprises: obtaining text to be processed;Participle and part-of-speech tagging are carried out to the text to be processed;Syntactic analysis is carried out to the text after participle, obtains the interdependent syntax tree of every words in the text to be processed;The stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent syntax tree to be analyzed is obtained;Small-world network is constructed using the interdependent syntax tree to be analyzed;Analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network;The hot information in the text to be processed is obtained according to analysis of central issue result.Using the present invention, the hot information in text to be processed can be efficiently and accurately found.

Description

Hot information finds method and system
Technical field
The present invention relates to data mining technology fields, and in particular to a kind of hot information discovery method and system.
Background technique
With the fast development of internet and being constantly progressive for memory technology, more and more text informations are flooded with us Around.But there is a large amount of redundancy in these information, step-by-step reading will obviously waste user's a large amount of time And energy.Analysis of central issue method can promptly extract crucial vocabulary or sentence information from a large amount of text information, i.e., Hot information allows user can be convenient and quickly recognizes important information included in text, to become researcher Research hotspot therefore how analysis of central issue efficiently and accurately can be carried out to text, find in text to be processed corresponding heat Point information becomes the top priority of analysis of central issue.
Existing analysis of central issue method is generally based on vocabulary co-occurrence method building small-world network, according to the network meter The different degree for calculating each node in network, the hot information of text to be processed is determined according to the different degree information.It is described important The calculating of degree is determined according to the average shortest path length variable quantity of the network.Existing method carries out the network struction When, do not consider the semantic information between vocabulary generally, the network of building is only measured according to the distance of adjacent words.However, If two vocabulary are in the text relatively far apart, but it is very close semantically contacting, and existing method can not then find this Connection.In addition, existing method only measures the weight of each node when calculating the different degree of each node using only shortest path It spends, feature is more single.The higher vocabulary of different degree obtained using existing method can not necessarily represent original text semanteme letter Breath.Simultaneously when calculating the different degree of each node, require to calculate shortest path all in network every time, efficiency compared with It is low.
Summary of the invention
The embodiment of the present invention provides a kind of hot information discovery method and system, to be processed efficiently and accurately to find Hot information in text.
For this purpose, the embodiment of the present invention provides the following technical solutions:
A kind of hot information discovery method, comprising:
Obtain text to be processed;
Participle and part-of-speech tagging are carried out to the text to be processed;
Syntactic analysis is carried out to the text after participle, obtains the interdependent syntax tree of every words in the text to be processed;
The stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent syntax tree to be analyzed is obtained;
Small-world network is constructed using the interdependent syntax tree to be analyzed;
Analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network;
The hot information in the text to be processed is obtained according to analysis of central issue result.
Preferably, it is described to the text to be processed carry out participle and part-of-speech tagging include:
Participle and part-of-speech tagging are carried out to the text to be processed using the method based on condition random field.
Preferably, the text after described pair of participle carries out syntactic analysis, obtain in the text to be processed every words according to Depositing syntax tree includes:
Interdependent syntax point is carried out to the text after participle using maximum spanning tree algorithm or method neural network based Analysis obtains the interdependent syntax tree of every words in the text to be processed.
Preferably, the stop words in removal text to be processed in the interdependent syntax tree of every words, obtain it is to be analyzed according to Depositing syntax tree includes:
For the interdependent syntax tree of every words in text to be processed, stop words therein is removed according to identical principle, and Node after removal stop words is attached;
Dependence represented by each edge before removal stop words is transferred completely on newly-generated side, and will Corresponding dependence different degree is set as the average value of all dependence different degrees on newly-generated side.
Preferably, described to include: according to the interdependent syntax tree to be analyzed and small-world network progress analysis of central issue
Each node and each edge in the interdependent syntax tree to be analyzed are calculated according to the interdependent syntax tree to be analyzed Interdependent frequency, the interdependent frequency of the node refer in all interdependent syntax trees to be analyzed of the text to be processed with the node The sum of the different degree of identical node, the interdependent frequency on the side refer in all interdependent syntax trees to be analyzed of text to be processed The sum of the dependence different degree on all sides identical with front is worked as occurred, the same edge refer to the node phase of the side connection Together;
The network correlated characteristic of each node and each edge in the small-world network is calculated according to the small-world network, The network correlated characteristic includes: interdependency and/or betweenness center, and the interdependency of the node refers in the small-world network The sum of the dependence different degree on the side being connected with the node, it is described while interdependency refer to described while two nodes connecting it is interdependent The sum of degree, the betweenness center refer to that the node or side appear in the small-world network any other two nodes most Number on short path;
Each node and/or side in the small-world network are calculated according to the interdependent frequency and the network correlated characteristic Different degree score.
Preferably, described hot information in the text to be processed is obtained according to analysis of central issue result to include:
Selection different degree score is described to be processed greater than being connected to for phrase represented by the node of given threshold or side Hot information in text;Or
Select phrase represented by the node for setting number or side from high to low according to different degree score is connected to institute State the hot information in text to be processed.
A kind of hot information discovery system, comprising:
Text obtains module, for obtaining text to be processed;
Preprocessing module, for carrying out participle and part-of-speech tagging to the text to be processed;
Syntactic analysis module obtains in the text to be processed every for carrying out syntactic analysis to the text after participle The interdependent syntax tree of words;
Sorting module obtains to be analyzed for removing the stop words in text to be processed in the interdependent syntax tree of every words Interdependent syntax tree;
Network struction module, for constructing small-world network using the interdependent syntax tree to be analyzed;
Analysis of central issue module, for carrying out hot spot point according to the interdependent syntax tree to be analyzed and the small-world network Analysis;
Hot information obtains module, for obtaining the hot information in the text to be processed according to analysis of central issue result.
Preferably, the preprocessing module segments the text to be processed using the method based on condition random field And part-of-speech tagging.
Preferably, the syntactic analysis module using maximum spanning tree algorithm or method neural network based to point Text after word carries out interdependent syntactic analysis, obtains the interdependent syntax tree of every words in the text to be processed.
Preferably, the sorting module, specifically for the interdependent syntax tree for every words in text to be processed, according to phase Same principle removes stop words therein, and the node after removal stop words is attached;It will be every before removing stop words Dependence represented by side, is transferred completely on newly-generated side, and set new for corresponding dependence different degree Generate the average value of all dependence different degrees on side.
Preferably, the analysis of central issue module includes: interdependent frequency computing module, feature calculation module and different degree score Computing module;The feature calculation module includes: interdependency computing module and/or betweenness center computing module;
The interdependent frequency computing module, for calculating the interdependent sentence to be analyzed according to the interdependent syntax tree to be analyzed The interdependent frequency of each node and each edge in method tree, the interdependent frequency of the node refer to needing point for the text to be processed The sum of the different degree of node identical with the node in interdependent syntax tree is analysed, the interdependent frequency on the side refers to text to be processed The sum of the dependence different degree on all sides identical with front is worked as occurred in all interdependent syntax trees to be analyzed is described identical The node connected when referring to described is identical;
The interdependency computing module, for calculating each node in the small-world network according to the small-world network With the interdependency of each edge, the interdependency of the node refers to the dependence on the side being connected in the small-world network with the node The sum of different degree, it is described while interdependency refer to described while the sum of two node interdependencies that connects;
The betweenness center computing module, it is each in the small-world network for being calculated according to the small-world network The betweenness center of node and each edge, the betweenness center refer to that the node or side appear in its in the small-world network Number on the shortest path of his any two node;
The different degree points calculating module, for according to the interdependent frequency and network correlated characteristic calculating The different degree score on each node and/or side in small-world network, the network correlated characteristic include: the interdependency, and/or Betweenness center.
Preferably, the hot information obtains module, and the node of given threshold is greater than specifically for selection different degree score Or the hot information of phrase represented by side being connected in the text to be processed;Or according to different degree score by height to The hot information of phrase represented by the node of low selection setting number or side being connected in the text to be processed.
Hot information provided in an embodiment of the present invention finds method and system, carries out worldlet according to interdependent syntactic analysis The building of network, can preferably stet sheet semantic information.After the completion of the network struction, network correlated characteristic is calculated And sort, analysis of central issue is carried out according to the result after sequence, the hot spot word in text to be processed is obtained according to analysis of central issue result Remittance relevant information, so as to efficiently and accurately analyze the hot information of text to be processed, and then effectively promotes user version The speed of reading saves reading time.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only one recorded in the present invention A little embodiments are also possible to obtain other drawings based on these drawings for those of ordinary skill in the art.
Fig. 1 is a kind of flow chart of hot information of embodiment of the present invention discovery method;
Fig. 2 is that the interdependent syntax tree example one before stop words is removed in the embodiment of the present invention;
Fig. 3 is that the interdependent syntax tree example one after stop words is removed in the embodiment of the present invention;
Fig. 4 is interdependent syntax tree example two in the embodiment of the present invention;
Fig. 5 is interdependent syntax tree example three in the embodiment of the present invention;
Fig. 6 is the small-world network part illustrated example constructed in the embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of hot information of embodiment of the present invention discovery system.
Specific embodiment
The scheme of embodiment in order to enable those skilled in the art to better understand the present invention with reference to the accompanying drawing and is implemented Mode is described in further detail the embodiment of the present invention.
As shown in Figure 1, being a kind of flow chart of hot information of embodiment of the present invention discovery method, comprising the following steps:
Step 101, text to be processed is obtained.
Step 102, participle and part-of-speech tagging are carried out to the text to be processed.
For example, participle and part-of-speech tagging can be carried out to the text to be processed using the method based on condition random field. Certainly, other methods can also be used and carry out participle and part-of-speech tagging, if participle can be matched with most long word, part-of-speech tagging can be used Method etc. based on HMM (Hidden Markov Model, hidden Markov model).
Step 103, syntactic analysis is carried out to the text after participle, obtains the interdependent sentence of every words in the text to be processed Method tree.
For example, can using maximum spanning tree algorithm or method neural network based to the text after participle carry out according to Syntactic analysis is deposited, the interdependent syntax tree of every words in the text to be processed is obtained.
For example, a word in text to be processed is " structure that small-world network is a kind of Special complex network ", Interdependent syntax tree is as shown in Figure 2.Wherein, the letter abbreviations on side are dependence, and every kind of dependence is endowed different weights It spends, as shown in table 1 below.
Table 1:
Dependence Different degree
Relationship ATT in fixed 1.0
Subject-predicate relationship SBV 1.0
Dynamic guest's relationship VOB 1.0
Quantitative relation QUN 0.9
" " word structure DE 0.5
Step 104, the stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent sentence to be analyzed is obtained Method tree.
The stop words refers to word nonsensical in text to be processed, such as " this ", "Yes", " uh ".
When removing stop words, for all interdependent syntax trees in text to be processed, it is based on identical principle, removes it In stop words.For example, relying on the principle of left sibling according to right node, the node after removal stop words is attached.Compare again Such as, the node after removal stop words is attached by the principle that right node can also be relied on according to left sibling.Furthermore it is also possible to By dependence represented by each edge before removal stop words, it is transferred completely on newly-generated side, corresponding interdependent pass Be different degree be newly-generated side on all dependence different degrees average value, it is of course also possible to select one it is representative according to Dependence different degree of the relationship different degree as newly-generated side is deposited, without limitation to this embodiment of the present invention.As shown in figure 3, It is interdependent after removing stop words for a word " structure that small-world network is a kind of Special complex network " in text to be processed Syntax tree.Wherein, there are two types of dependences with "Yes" node before " network " node and " structure " knot-removal stop words, i.e., SVB and VOB, referring to fig. 2.After removing stop words, both dependences are transferred on newly-generated side, newly-generated side Dependence different degree is the average value of described two dependence different degrees.
Step 105, small-world network is constructed using the interdependent syntax tree to be analyzed.
Small-world network is constructed according to the interdependent syntax tree after every words removal stop words, detailed process is as follows:
1) abortive haul network G=(V, E) is initialized;V indicates the set of node, and E indicates the set on side;
2) the interdependent syntax tree in text to be processed after every words removal stop words is successively obtained;
3) according to depth-first or the principle of breadth First, every interdependent syntax tree is successively traversed since root node;
4) when traversing a node, judge that present node whether there is in set V, if it is present under successively traversing One node;If it does not exist, then present node is added in set V;
5) when traversing a line, judgement whether there is in set E when front, if it is present successively traversing next Side, if it does not exist, then will be added in E when front;
6) judging whether all interdependent syntax trees of text to be processed traverse terminates, if terminating to execute step 7), otherwise holds Row step is 2);
7) the interdependent syntax tree in text to be processed has all been traversed, small-world network G=(V, E) is obtained.
If Fig. 4 is that second word " in such networks most node each other and be not attached to " is gone in text to be processed Except the interdependent syntax tree after stop words, Fig. 5 is that text third word to be processed " but passes through several steps just between most of node It is reachable " removal stop words after interdependent syntax tree.
According to small-world network such as Fig. 6 institute of the interdependent syntax tree building after all removal stop words of text to be processed Show, Fig. 6 is the corresponding part small-world network figure of text to be processed.
Step 106, analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network.
Specifically, each node in the interdependent syntax tree to be analyzed can be calculated according to the interdependent syntax tree to be analyzed With the interdependent frequency of each edge, and each node and each edge in the small-world network are calculated according to the small-world network Network correlated characteristic, the network correlated characteristic include: interdependency and/or betweenness center;Then according to the interdependent frequency And the network correlated characteristic calculates the different degree score on each node and/or side in the small-world network.
Above-mentioned interdependent frequency, interdependency, the concept of betweenness center and calculation are described in detail below.
1) the interdependent frequency of each node and each edge in interdependent syntax tree to be analyzed is calculated according to interdependent syntax tree to be analyzed Degree.
Phrase number dependent on current vocabulary is more, then the interdependent frequency of this vocabulary is higher, the phrase according to Depositing in syntax tree is indicated using node.
The interdependent frequency of the node refers to identical as present node in all interdependent syntax trees to be analyzed of text to be processed The sum of the different degree of node, the calculation method of the different degree is all nodes for directly relying on or indirectly relying on present node Several square roots, in Fig. 3, the node for directly relying on " network " node has 2, and the node for indirectly relying on " network " node has 4 It is a, rely on number of nodes totally 6, then the square root that different degree of " network " node on the interdependent syntax tree of Fig. 3 is 6, i.e., 2.45. Similarly, different degree of " network " node on the interdependent syntax tree of Fig. 4 be 1, if " network " word in text to be processed only There is this twice, then the interdependent frequency of " network " node is 2.45+1=3.45.Shown in circular such as formula (1).
Wherein, NDDegiIndicate the interdependent frequency of i-th of node, ViIndicate number of nodes identical with i-th of node, Nproj For all number of nodes for directly or indirectly relying on j-th of node.
The interdependent frequency on the side refer to occur in all interdependent syntax trees to be analyzed of text to be processed with when front phase Same the sum of the dependence different degree on all sides, the same edge refer to that the node of the side connection is identical.Such as Fig. 3 " worldlet- The dependence of network " this edge is ATT, and corresponding different degree is 1.0, if having also appeared " small a generation in whole network The side on boundary-network ", dependence LAD, corresponding different degree is 0.6, then the interdependent frequency of " worldlet-network " this edge Degree is 1.6, shown in circular such as formula (2):
Wherein, EDDegkIndicate the interdependent frequency on kth side, EkIndicate number of edges identical with kth side, IDegeIndicate the The dependence different degree on e side.
2) interdependency of each node and each edge in network is calculated according to small-world network.
According to every kind of dependence different degree in interdependent syntactic relation, each node and each edge in the network are calculated Interdependency.
The interdependency of the node refers to the sum of the dependence different degree on the side being connected in network with the node.Such as Fig. 3 In, " network " node shares 2 sides and is connected, and the dependence of a line is ATT, and corresponding dependence different degree is 1.0, The dependence on Article 2 side is SBV-VOB, and corresponding dependence different degree is the flat of SBV and VOB dependence different degree Mean value, i.e., 1.0.Therefore, the interdependency of " network " node is 2.0, as shown in formula (3).
Wherein, NIDegiIndicate the interdependency of i-th of node, NiIndicate the number on the side being connected with i-th of node, IDegk Indicate the corresponding dependence different degree in kth side.
It is described while interdependency refer to described while the sum of two node interdependencies that connects, in Fig. 3, " worldlet-network " The interdependency of this edge is the sum of the interdependency of " worldlet " node and " network " node, specific to calculate as shown in formula (4):
EIDegk=NIDegi1+NIDegi2 (4)
Wherein, EIDegkIndicate the interdependency on kth side, NIDegi1And NIDegi2Indicate two connect with kth side The interdependency of node i 1 and i2.
3) betweenness center on each node or side in network is calculated according to small-world network
The betweenness center refers to that the node or side occur on the shortest path of any other two nodes in a network Number, such as in Fig. 3, shortest path between " worldlet " node and " structure " node is " worldlet-network-structure ", Shortest path length is 2, and " network " node has appeared on the shortest path of " worldlet " node and " structure " node, then " net The betweenness center of network " node is 1, if " network " node also occurs on the shortest path between other two node, institute The betweenness center for stating node is 2." worldlet-network " this edge also appears on shortest path, if the side does not occur On shortest path between other nodes, then the betweenness center of " worldlet-network " this edge is 1.When calculating shortest path, Conventional method can be used in the distance between adjacent node measurement, i.e., with 1 measurement, it is possible to use side is interdependent between two nodes The inverse of frequency is measured.If the interdependent frequency on " worldlet-network " side in Fig. 3 is assumed to be 1.6, " worldlet " node and " net The inverse that the distance between network " node measurement is 1.6, i.e., 0.625.
The interdependent frequency and interdependency and/or betweenness center these features are being calculated, can comprehensively utilize These features determine the different degree score on each node and/or side in small-world network.It should be noted that in practical application In, the different degree score on each node and/or side in small-world network can be calculated using these three features simultaneously, it can also It, can also to calculate the different degree score on each node and/or side in small-world network using the interdependent frequency and interdependency To calculate the different degree score on each node and/or side in small-world network using the interdependent frequency and betweenness center, Without limitation to this embodiment of the present invention.
Below to calculate the different degree score on each node and/or side in small-world network using these three features simultaneously For be illustrated.
Using above-mentioned three kinds of features as the three-dimensional feature of each lexical node in the text to be processed, due to every dimensional feature Valued space it is different, can not directly utilize, therefore first the value to every dimensional feature can carry out regular, specific regular method can It is carried out in a manner of using Ordering and marking, or uses other regular methods, if the characteristic value in every dimension is divided by current dimensional feature The summation of value, obtain it is regular after characteristic value.
By taking Ordering and marking method as an example, sorted from small to large to every dimensional feature value, the index after characteristic value is sorted As the score of current characteristic value, such as interdependent frequency of " network " node is 2.45, interdependency 2.0, betweenness center 2, Index after sequence is respectively 3,6,10, then the three-dimensional feature score of the node is respectively 3,6,10.
Using the three-dimensional feature score after regular, the different degree score on each node and/or side in network can be calculated, specifically As shown in formula (5):
Wherein, FScoreiFor i-th of node or the different degree score on side, ScoreijFor the jth on i-th of node or side dimension The score of feature.R is the intrinsic dimensionality on each node or side, such as 3 dimensions.
Step 107, the hot information in the text to be processed is obtained according to analysis of central issue result.
Specifically, it can choose different degree score being connected to greater than phrase represented by the node of given threshold or side Hot information in the text to be processed;Or select setting number (such as 10) from high to low according to different degree score The hot information of phrase represented by node or side being connected in the text to be processed.In Fig. 6, three groups of heat of acquisition Point information are as follows: network-node, network-structure, node-major part.
The hot information of the embodiment of the present invention finds method, and the building of small-world network is carried out according to interdependent syntactic analysis, Can preferably stet sheet semantic information.After the completion of the network struction, calculates network correlated characteristic and sort, according to Result after sequence carries out analysis of central issue, obtains the hot spot vocabulary relevant information in text to be processed according to analysis of central issue result, So as to efficiently and accurately analyze the hot information of text to be processed, and then the speed of user version reading is effectively promoted, Save reading time.
Correspondingly, the embodiment of the present invention also provides a kind of hot information discovery system, as shown in fig. 7, being the one of the system Kind structural schematic diagram.
In this embodiment, the system comprises:
Text obtains module 701, for obtaining text to be processed;
Preprocessing module 702, for carrying out participle and part-of-speech tagging to the text to be processed;
Syntactic analysis module 703 obtains every in the text to be processed for carrying out syntactic analysis to the text after participle The interdependent syntax tree of word;
Sorting module 704 is obtained for removing the stop words in text to be processed in the interdependent syntax tree of every words wait divide Analyse interdependent syntax tree;
Network struction module 705, for constructing small-world network using the interdependent syntax tree to be analyzed;
Analysis of central issue module 706, for carrying out hot spot according to the interdependent syntax tree to be analyzed and the small-world network Analysis;
Hot information obtains module 707, for obtaining the letter of the hot spot in the text to be processed according to analysis of central issue result Breath.
Above-mentioned preprocessing module 702 can segment the text to be processed using the method based on condition random field And part-of-speech tagging.Above-mentioned syntactic analysis module 703 can use maximum spanning tree algorithm or method pair neural network based Text after participle carries out interdependent syntactic analysis, obtains the interdependent syntax tree of every words in the text to be processed.Certainly, this two A module can also complete participle, part-of-speech tagging and the process of syntactic analysis using other methods, to this embodiment of the present invention Without limitation.
It should be noted that interdependent syntax tree of the sorting module 704 for every words in text to be processed, according to phase Same principle removes stop words therein, and the node after removal stop words is attached.For example, being relied on according to right node left Node after removal stop words is attached, or relies on the principle of right node according to left sibling by the principle of node, will remove Node after stop words is attached.In addition, also by dependence represented by each edge before removal stop words, all It is transferred on newly-generated side.Furthermore it is also possible to set all interdependent on newly-generated side for corresponding dependence different degree The average value of relationship different degree, it is of course also possible to select representative dependence different degree as newly-generated side according to Deposit relationship different degree.
In practical applications, the analysis of central issue module 706 can by calculate small-world network in each node and/or The different degree score on side carries out analysis of central issue.A kind of specific structure of the module includes: interdependent frequency computing module, feature meter Calculate module and different degree points calculating module;The feature calculation module includes: interdependency computing module and/or betweenness center Computing module.Wherein:
The interdependent frequency computing module, for calculating the interdependent sentence to be analyzed according to the interdependent syntax tree to be analyzed The interdependent frequency of each node and each edge in method tree, the interdependent frequency of the node refer to needing point for the text to be processed The sum of the different degree of node identical with the node in interdependent syntax tree is analysed, the interdependent frequency on the side refers to text to be processed The sum of the dependence different degree on all sides identical with front is worked as occurred in all interdependent syntax trees to be analyzed is described identical The node connected when referring to described is identical;
The interdependency computing module, for calculating each node in the small-world network according to the small-world network With the interdependency of each edge, the interdependency of the node refers to the dependence on the side being connected in the small-world network with the node The sum of different degree, it is described while interdependency refer to described while the sum of two node interdependencies that connects;
The betweenness center computing module, it is each in the small-world network for being calculated according to the small-world network The betweenness center of node and each edge, the betweenness center refer to that the node or side appear in its in the small-world network Number on the shortest path of his any two node;
The different degree points calculating module, for according to the interdependent frequency and network correlated characteristic calculating The different degree score on each node and/or side in small-world network, the network correlated characteristic include: the interdependency, and/or Betweenness center.
Correspondingly, above-mentioned hot information obtain module 707 can choose node of the different degree score greater than given threshold or The hot information of phrase represented by side being connected in the text to be processed;Or from high to low according to different degree score The hot information of phrase represented by the node of selection setting number or side being connected in the text to be processed.
The hot information of the embodiment of the present invention finds system, and the building of small-world network is carried out according to interdependent syntactic analysis, Can preferably stet sheet semantic information.After the completion of the network struction, calculates network correlated characteristic and sort, according to Result after sequence carries out analysis of central issue, obtains the hot spot vocabulary relevant information in text to be processed according to analysis of central issue result, So as to efficiently and accurately analyze the hot information of text to be processed, and then the speed of user version reading is effectively promoted, Save reading time.
It should be noted that the hot information of the embodiment of the present invention finds method and system, natural language can be applied to The fields such as processing, information search, information processing can be obtained efficiently and accurately the hot spot word to play an important role in text to be processed Remittance relevant information.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.System embodiment described above is only schematical, wherein described be used as separate part description Unit may or may not be physically separated, component shown as a unit may or may not be Physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to the actual needs Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying In the case where creative work, it can understand and implement.
The embodiment of the present invention has been described in detail above, and specific embodiment used herein carries out the present invention It illustrates, method and system of the invention that the above embodiments are only used to help understand;Meanwhile for the one of this field As technical staff, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute It states, the contents of this specification are not to be construed as limiting the invention.

Claims (12)

1. a kind of hot information finds method characterized by comprising
Obtain text to be processed;
Participle and part-of-speech tagging are carried out to the text to be processed;
Syntactic analysis is carried out to the text after participle, obtains the interdependent syntax tree of every words in the text to be processed;
The stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent syntax tree to be analyzed is obtained;
Small-world network is constructed using the interdependent syntax tree to be analyzed;
Analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network, including is calculated separately described wait divide The network correlated characteristic for analysing each element in the interdependent frequency and the small-world network of each element in interdependent syntax tree, according to institute It states interdependent frequency and the network correlated characteristic carries out analysis of central issue;
The hot information in the text to be processed is obtained according to analysis of central issue result.
2. the method according to claim 1, wherein described carry out participle and part of speech mark to the text to be processed Note includes:
Participle and part-of-speech tagging are carried out to the text to be processed using the method based on condition random field.
3. being obtained the method according to claim 1, wherein the text after described pair of participle carries out syntactic analysis The interdependent syntax tree of every words includes: in the text to be processed
Interdependent syntactic analysis is carried out to the text after participle using maximum spanning tree algorithm or method neural network based, is obtained The interdependent syntax tree of every words into the text to be processed.
4. the method according to claim 1, wherein every interdependent syntax talked about in the removal text to be processed Stop words in tree, obtaining interdependent syntax tree to be analyzed includes:
For the interdependent syntax tree of every words in text to be processed, stop words therein is removed according to identical principle, and will go Except the node after stop words is attached;
By dependence represented by each edge before removal stop words, it is transferred completely on newly-generated side, and will correspond to Dependence different degree be set as the average value of all dependence different degrees on newly-generated side.
5. method according to any one of claims 1 to 4, which is characterized in that
The interdependent frequency of each element includes: in the calculating interdependent syntax tree to be analyzed
The interdependent of each node and each edge in the interdependent syntax tree to be analyzed is calculated according to the interdependent syntax tree to be analyzed Frequency, the interdependent frequency of the node refer to identical as the node in all interdependent syntax trees to be analyzed of the text to be processed The sum of the different degree of node, the interdependent frequency on the side, which refers to, to be occurred in all interdependent syntax trees to be analyzed of text to be processed The node of the sum of the dependence different degree on all sides identical with front is worked as, the identical side Bian Zhiyu connection is identical;
The network correlated characteristic of each element includes: in the calculating small-world network
The network correlated characteristic of each node and each edge in the small-world network is calculated according to the small-world network, it is described Network correlated characteristic includes: interdependency and/or betweenness center, and the interdependency of the node refers in the small-world network and is somebody's turn to do Node connected the sum of the dependence different degree on side, it is described while interdependency refer to described while two node interdependencies connecting With the betweenness center refers to that the node or side appear in the shortest path of any other two nodes in the small-world network Number on diameter;
It is described to include: according to the interdependent frequency and network correlated characteristic progress analysis of central issue
Each node and/or the weight on side in the small-world network are calculated according to the interdependent frequency and the network correlated characteristic Spend score.
6. according to the method described in claim 5, it is characterized in that, described obtain the text to be processed according to analysis of central issue result Hot information in this includes:
Selection different degree score is connected to the text to be processed greater than phrase represented by the node of given threshold or side In hot information;Or
Selected from high to low according to different degree score setting number node or side represented by described in being connected to of phrase to Handle the hot information in text.
7. a kind of hot information finds system characterized by comprising
Text obtains module, for obtaining text to be processed;
Preprocessing module, for carrying out participle and part-of-speech tagging to the text to be processed;
Syntactic analysis module obtains in the text to be processed every words for carrying out syntactic analysis to the text after participle Interdependent syntax tree;
Sorting module obtains to be analyzed interdependent for removing the stop words in text to be processed in the interdependent syntax tree of every words Syntax tree;
Network struction module, for constructing small-world network using the interdependent syntax tree to be analyzed;
Analysis of central issue module, for carrying out analysis of central issue, packet according to the interdependent syntax tree to be analyzed and the small-world network It includes and calculates separately in the interdependent syntax tree to be analyzed each element in the interdependent frequency and the small-world network of each element Network correlated characteristic carries out analysis of central issue according to the interdependent frequency and the network correlated characteristic;
Hot information obtains module, for obtaining the hot information in the text to be processed according to analysis of central issue result.
8. system according to claim 7, which is characterized in that the preprocessing module uses the side based on condition random field Method carries out participle and part-of-speech tagging to the text to be processed.
9. system according to claim 7, which is characterized in that the syntactic analysis module using maximum spanning tree algorithm or Person's method neural network based carries out interdependent syntactic analysis to the text after participle, obtains every words in the text to be processed Interdependent syntax tree.
10. system according to claim 7, which is characterized in that
The sorting module is gone specifically for the interdependent syntax tree for every words in text to be processed according to identical principle It is attached except stop words therein, and by the node after removal stop words;It will be represented by each edge before removal stop words Dependence, be transferred completely on newly-generated side, and set institute on newly-generated side for corresponding dependence different degree There is the average value of dependence different degree.
11. according to the described in any item systems of claim 7 to 10, which is characterized in that the analysis of central issue module includes: interdependent Frequency computing module, feature calculation module and different degree points calculating module;The feature calculation module includes: that interdependency calculates Module and/or betweenness center computing module;
The interdependent frequency computing module, for calculating the interdependent syntax tree to be analyzed according to the interdependent syntax tree to be analyzed In each node and each edge interdependent frequency, the interdependent frequency of the node refer to the text to be processed it is all it is to be analyzed according to The sum of the different degree of node identical with the node in syntax tree is deposited, the interdependent frequency on the side refers to all of text to be processed The sum of the dependence different degree on all sides identical with front is worked as occurred in interdependent syntax tree to be analyzed, identical Bian Zhiyu The node of side connection is identical;
The interdependency computing module, for according to each node in the small-world network calculating small-world network and often The interdependency on side, the interdependency of the node refer to that the dependence on the side being connected in the small-world network with the node is important The sum of degree, it is described while interdependency refer to described while the sum of two node interdependencies that connects;
The betweenness center computing module, for calculating each node in the small-world network according to the small-world network With the betweenness center of each edge, the betweenness center refers to that the node or side appear in other in the small-world network The number anticipated on the shortest path of two nodes;
The different degree points calculating module, for calculating the small generation according to the interdependent frequency and the network correlated characteristic The different degree score on each node and/or side in boundary's network, the network correlated characteristic includes: the interdependency, and/or betweenness Centrality.
12. system according to claim 11, which is characterized in that
The hot information obtains module, is greater than represented by node or the side of given threshold specifically for selection different degree score The hot information of phrase being connected in the text to be processed;Or select setting from high to low according to different degree score The hot information of phrase represented by several nodes or side being connected in the text to be processed.
CN201510137773.9A 2015-03-26 2015-03-26 Hot information finds method and system Active CN106156041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510137773.9A CN106156041B (en) 2015-03-26 2015-03-26 Hot information finds method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510137773.9A CN106156041B (en) 2015-03-26 2015-03-26 Hot information finds method and system

Publications (2)

Publication Number Publication Date
CN106156041A CN106156041A (en) 2016-11-23
CN106156041B true CN106156041B (en) 2019-05-28

Family

ID=57339541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510137773.9A Active CN106156041B (en) 2015-03-26 2015-03-26 Hot information finds method and system

Country Status (1)

Country Link
CN (1) CN106156041B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108133014B (en) * 2017-12-22 2022-03-22 广州数说故事信息科技有限公司 Triple generation method and device based on syntactic analysis and clustering and user terminal
CN110852095B (en) * 2018-08-02 2023-09-19 中国银联股份有限公司 Statement hot spot extraction method and system
CN109062902B (en) * 2018-08-17 2022-12-06 科大讯飞股份有限公司 Text semantic expression method and device
CN110069624B (en) * 2019-04-28 2021-05-04 北京小米智能科技有限公司 Text processing method and device
CN111209746B (en) * 2019-12-30 2024-01-30 航天信息股份有限公司 Natural language processing method and device, storage medium and electronic equipment
CN110874531B (en) * 2020-01-20 2020-07-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN111339751A (en) * 2020-05-15 2020-06-26 支付宝(杭州)信息技术有限公司 Text keyword processing method, device and equipment
CN113434751B (en) * 2021-07-14 2023-06-02 国际关系学院 Network hotspot artificial intelligent early warning system and method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661513A (en) * 2009-10-21 2010-03-03 上海交通大学 Detection method of network focus and public sentiment
CN102567405A (en) * 2010-12-31 2012-07-11 北京安码科技有限公司 Hotspot discovery method based on improved text space vector representation
US20130204882A1 (en) * 2012-02-07 2013-08-08 Social Market Analytics, Inc. Systems And Methods of Detecting, Measuring, And Extracting Signatures of Signals Embedded in Social Media Data Streams
CN103473223A (en) * 2013-09-25 2013-12-25 中国科学院计算技术研究所 Rule extraction and translation method based on syntax tree
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN104516874A (en) * 2014-12-29 2015-04-15 北京牡丹电子集团有限责任公司数字电视技术中心 Method and system for parsing dependency of noun phrases
CN105095288A (en) * 2014-05-14 2015-11-25 腾讯科技(深圳)有限公司 Data analysis method and data analysis device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661513A (en) * 2009-10-21 2010-03-03 上海交通大学 Detection method of network focus and public sentiment
CN102567405A (en) * 2010-12-31 2012-07-11 北京安码科技有限公司 Hotspot discovery method based on improved text space vector representation
US20130204882A1 (en) * 2012-02-07 2013-08-08 Social Market Analytics, Inc. Systems And Methods of Detecting, Measuring, And Extracting Signatures of Signals Embedded in Social Media Data Streams
CN103473223A (en) * 2013-09-25 2013-12-25 中国科学院计算技术研究所 Rule extraction and translation method based on syntax tree
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN105095288A (en) * 2014-05-14 2015-11-25 腾讯科技(深圳)有限公司 Data analysis method and data analysis device
CN104516874A (en) * 2014-12-29 2015-04-15 北京牡丹电子集团有限责任公司数字电视技术中心 Method and system for parsing dependency of noun phrases

Also Published As

Publication number Publication date
CN106156041A (en) 2016-11-23

Similar Documents

Publication Publication Date Title
CN106156041B (en) Hot information finds method and system
Ren et al. On querying historical evolving graph sequences
CN105808526B (en) Commodity short text core word extracting method and device
CN104102745B (en) Complex network community method for digging based on Local Minimum side
CN105224648A (en) A kind of entity link method and system
CN105447081A (en) Cloud platform-oriented government affair and public opinion monitoring method
CN105045847B (en) A kind of method that Chinese institutional units title is extracted from text message
CN103324666A (en) Topic tracing method and device based on micro-blog data
WO2008016495A2 (en) Determination of graph connectivity metrics using bit-vectors
US20170109633A1 (en) Comment-comment and comment-document analysis of documents
WO2014127673A1 (en) Method and apparatus for acquiring hot topics
CN105302882B (en) Obtain the method and device of keyword
CN107562772A (en) Event extraction method, apparatus, system and storage medium
CN110321466A (en) A kind of security information duplicate checking method and system based on semantic analysis
CN106294418B (en) Search method and searching system
CN106547864A (en) A kind of Personalized search based on query expansion
CN106886579A (en) Real-time streaming textual hierarchy monitoring method and device
Afzaal et al. A novel framework for aspect-based opinion classification for tourist places
CN108304382A (en) Mass analysis method based on manufacturing process text data digging and system
Sarkar et al. A comparative analysis of particle swarm optimization and K-means algorithm for text clustering using Nepali Wordnet
CN103095849B (en) A method and a system of spervised web service finding based on attribution forecast and error correction of quality of service (QoS)
AU2018312543B2 (en) Systems and methods for extracting structure from large, dense, and noisy networks
CN111680498A (en) Entity disambiguation method, device, storage medium and computer equipment
CN104408036B (en) It is associated with recognition methods and the device of topic
CN109299463A (en) A kind of calculation method and relevant device of emotion score

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant