CN106156041B - Hot information finds method and system - Google Patents
Hot information finds method and system Download PDFInfo
- Publication number
- CN106156041B CN106156041B CN201510137773.9A CN201510137773A CN106156041B CN 106156041 B CN106156041 B CN 106156041B CN 201510137773 A CN201510137773 A CN 201510137773A CN 106156041 B CN106156041 B CN 106156041B
- Authority
- CN
- China
- Prior art keywords
- text
- node
- interdependent
- processed
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a kind of hot informations to find method and system, this method comprises: obtaining text to be processed;Participle and part-of-speech tagging are carried out to the text to be processed;Syntactic analysis is carried out to the text after participle, obtains the interdependent syntax tree of every words in the text to be processed;The stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent syntax tree to be analyzed is obtained;Small-world network is constructed using the interdependent syntax tree to be analyzed;Analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network;The hot information in the text to be processed is obtained according to analysis of central issue result.Using the present invention, the hot information in text to be processed can be efficiently and accurately found.
Description
Technical field
The present invention relates to data mining technology fields, and in particular to a kind of hot information discovery method and system.
Background technique
With the fast development of internet and being constantly progressive for memory technology, more and more text informations are flooded with us
Around.But there is a large amount of redundancy in these information, step-by-step reading will obviously waste user's a large amount of time
And energy.Analysis of central issue method can promptly extract crucial vocabulary or sentence information from a large amount of text information, i.e.,
Hot information allows user can be convenient and quickly recognizes important information included in text, to become researcher
Research hotspot therefore how analysis of central issue efficiently and accurately can be carried out to text, find in text to be processed corresponding heat
Point information becomes the top priority of analysis of central issue.
Existing analysis of central issue method is generally based on vocabulary co-occurrence method building small-world network, according to the network meter
The different degree for calculating each node in network, the hot information of text to be processed is determined according to the different degree information.It is described important
The calculating of degree is determined according to the average shortest path length variable quantity of the network.Existing method carries out the network struction
When, do not consider the semantic information between vocabulary generally, the network of building is only measured according to the distance of adjacent words.However,
If two vocabulary are in the text relatively far apart, but it is very close semantically contacting, and existing method can not then find this
Connection.In addition, existing method only measures the weight of each node when calculating the different degree of each node using only shortest path
It spends, feature is more single.The higher vocabulary of different degree obtained using existing method can not necessarily represent original text semanteme letter
Breath.Simultaneously when calculating the different degree of each node, require to calculate shortest path all in network every time, efficiency compared with
It is low.
Summary of the invention
The embodiment of the present invention provides a kind of hot information discovery method and system, to be processed efficiently and accurately to find
Hot information in text.
For this purpose, the embodiment of the present invention provides the following technical solutions:
A kind of hot information discovery method, comprising:
Obtain text to be processed;
Participle and part-of-speech tagging are carried out to the text to be processed;
Syntactic analysis is carried out to the text after participle, obtains the interdependent syntax tree of every words in the text to be processed;
The stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent syntax tree to be analyzed is obtained;
Small-world network is constructed using the interdependent syntax tree to be analyzed;
Analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network;
The hot information in the text to be processed is obtained according to analysis of central issue result.
Preferably, it is described to the text to be processed carry out participle and part-of-speech tagging include:
Participle and part-of-speech tagging are carried out to the text to be processed using the method based on condition random field.
Preferably, the text after described pair of participle carries out syntactic analysis, obtain in the text to be processed every words according to
Depositing syntax tree includes:
Interdependent syntax point is carried out to the text after participle using maximum spanning tree algorithm or method neural network based
Analysis obtains the interdependent syntax tree of every words in the text to be processed.
Preferably, the stop words in removal text to be processed in the interdependent syntax tree of every words, obtain it is to be analyzed according to
Depositing syntax tree includes:
For the interdependent syntax tree of every words in text to be processed, stop words therein is removed according to identical principle, and
Node after removal stop words is attached;
Dependence represented by each edge before removal stop words is transferred completely on newly-generated side, and will
Corresponding dependence different degree is set as the average value of all dependence different degrees on newly-generated side.
Preferably, described to include: according to the interdependent syntax tree to be analyzed and small-world network progress analysis of central issue
Each node and each edge in the interdependent syntax tree to be analyzed are calculated according to the interdependent syntax tree to be analyzed
Interdependent frequency, the interdependent frequency of the node refer in all interdependent syntax trees to be analyzed of the text to be processed with the node
The sum of the different degree of identical node, the interdependent frequency on the side refer in all interdependent syntax trees to be analyzed of text to be processed
The sum of the dependence different degree on all sides identical with front is worked as occurred, the same edge refer to the node phase of the side connection
Together;
The network correlated characteristic of each node and each edge in the small-world network is calculated according to the small-world network,
The network correlated characteristic includes: interdependency and/or betweenness center, and the interdependency of the node refers in the small-world network
The sum of the dependence different degree on the side being connected with the node, it is described while interdependency refer to described while two nodes connecting it is interdependent
The sum of degree, the betweenness center refer to that the node or side appear in the small-world network any other two nodes most
Number on short path;
Each node and/or side in the small-world network are calculated according to the interdependent frequency and the network correlated characteristic
Different degree score.
Preferably, described hot information in the text to be processed is obtained according to analysis of central issue result to include:
Selection different degree score is described to be processed greater than being connected to for phrase represented by the node of given threshold or side
Hot information in text;Or
Select phrase represented by the node for setting number or side from high to low according to different degree score is connected to institute
State the hot information in text to be processed.
A kind of hot information discovery system, comprising:
Text obtains module, for obtaining text to be processed;
Preprocessing module, for carrying out participle and part-of-speech tagging to the text to be processed;
Syntactic analysis module obtains in the text to be processed every for carrying out syntactic analysis to the text after participle
The interdependent syntax tree of words;
Sorting module obtains to be analyzed for removing the stop words in text to be processed in the interdependent syntax tree of every words
Interdependent syntax tree;
Network struction module, for constructing small-world network using the interdependent syntax tree to be analyzed;
Analysis of central issue module, for carrying out hot spot point according to the interdependent syntax tree to be analyzed and the small-world network
Analysis;
Hot information obtains module, for obtaining the hot information in the text to be processed according to analysis of central issue result.
Preferably, the preprocessing module segments the text to be processed using the method based on condition random field
And part-of-speech tagging.
Preferably, the syntactic analysis module using maximum spanning tree algorithm or method neural network based to point
Text after word carries out interdependent syntactic analysis, obtains the interdependent syntax tree of every words in the text to be processed.
Preferably, the sorting module, specifically for the interdependent syntax tree for every words in text to be processed, according to phase
Same principle removes stop words therein, and the node after removal stop words is attached;It will be every before removing stop words
Dependence represented by side, is transferred completely on newly-generated side, and set new for corresponding dependence different degree
Generate the average value of all dependence different degrees on side.
Preferably, the analysis of central issue module includes: interdependent frequency computing module, feature calculation module and different degree score
Computing module;The feature calculation module includes: interdependency computing module and/or betweenness center computing module;
The interdependent frequency computing module, for calculating the interdependent sentence to be analyzed according to the interdependent syntax tree to be analyzed
The interdependent frequency of each node and each edge in method tree, the interdependent frequency of the node refer to needing point for the text to be processed
The sum of the different degree of node identical with the node in interdependent syntax tree is analysed, the interdependent frequency on the side refers to text to be processed
The sum of the dependence different degree on all sides identical with front is worked as occurred in all interdependent syntax trees to be analyzed is described identical
The node connected when referring to described is identical;
The interdependency computing module, for calculating each node in the small-world network according to the small-world network
With the interdependency of each edge, the interdependency of the node refers to the dependence on the side being connected in the small-world network with the node
The sum of different degree, it is described while interdependency refer to described while the sum of two node interdependencies that connects;
The betweenness center computing module, it is each in the small-world network for being calculated according to the small-world network
The betweenness center of node and each edge, the betweenness center refer to that the node or side appear in its in the small-world network
Number on the shortest path of his any two node;
The different degree points calculating module, for according to the interdependent frequency and network correlated characteristic calculating
The different degree score on each node and/or side in small-world network, the network correlated characteristic include: the interdependency, and/or
Betweenness center.
Preferably, the hot information obtains module, and the node of given threshold is greater than specifically for selection different degree score
Or the hot information of phrase represented by side being connected in the text to be processed;Or according to different degree score by height to
The hot information of phrase represented by the node of low selection setting number or side being connected in the text to be processed.
Hot information provided in an embodiment of the present invention finds method and system, carries out worldlet according to interdependent syntactic analysis
The building of network, can preferably stet sheet semantic information.After the completion of the network struction, network correlated characteristic is calculated
And sort, analysis of central issue is carried out according to the result after sequence, the hot spot word in text to be processed is obtained according to analysis of central issue result
Remittance relevant information, so as to efficiently and accurately analyze the hot information of text to be processed, and then effectively promotes user version
The speed of reading saves reading time.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, below will be to institute in embodiment
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only one recorded in the present invention
A little embodiments are also possible to obtain other drawings based on these drawings for those of ordinary skill in the art.
Fig. 1 is a kind of flow chart of hot information of embodiment of the present invention discovery method;
Fig. 2 is that the interdependent syntax tree example one before stop words is removed in the embodiment of the present invention;
Fig. 3 is that the interdependent syntax tree example one after stop words is removed in the embodiment of the present invention;
Fig. 4 is interdependent syntax tree example two in the embodiment of the present invention;
Fig. 5 is interdependent syntax tree example three in the embodiment of the present invention;
Fig. 6 is the small-world network part illustrated example constructed in the embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of hot information of embodiment of the present invention discovery system.
Specific embodiment
The scheme of embodiment in order to enable those skilled in the art to better understand the present invention with reference to the accompanying drawing and is implemented
Mode is described in further detail the embodiment of the present invention.
As shown in Figure 1, being a kind of flow chart of hot information of embodiment of the present invention discovery method, comprising the following steps:
Step 101, text to be processed is obtained.
Step 102, participle and part-of-speech tagging are carried out to the text to be processed.
For example, participle and part-of-speech tagging can be carried out to the text to be processed using the method based on condition random field.
Certainly, other methods can also be used and carry out participle and part-of-speech tagging, if participle can be matched with most long word, part-of-speech tagging can be used
Method etc. based on HMM (Hidden Markov Model, hidden Markov model).
Step 103, syntactic analysis is carried out to the text after participle, obtains the interdependent sentence of every words in the text to be processed
Method tree.
For example, can using maximum spanning tree algorithm or method neural network based to the text after participle carry out according to
Syntactic analysis is deposited, the interdependent syntax tree of every words in the text to be processed is obtained.
For example, a word in text to be processed is " structure that small-world network is a kind of Special complex network ",
Interdependent syntax tree is as shown in Figure 2.Wherein, the letter abbreviations on side are dependence, and every kind of dependence is endowed different weights
It spends, as shown in table 1 below.
Table 1:
Dependence | Different degree |
Relationship ATT in fixed | 1.0 |
Subject-predicate relationship SBV | 1.0 |
Dynamic guest's relationship VOB | 1.0 |
Quantitative relation QUN | 0.9 |
" " word structure DE | 0.5 |
Step 104, the stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent sentence to be analyzed is obtained
Method tree.
The stop words refers to word nonsensical in text to be processed, such as " this ", "Yes", " uh ".
When removing stop words, for all interdependent syntax trees in text to be processed, it is based on identical principle, removes it
In stop words.For example, relying on the principle of left sibling according to right node, the node after removal stop words is attached.Compare again
Such as, the node after removal stop words is attached by the principle that right node can also be relied on according to left sibling.Furthermore it is also possible to
By dependence represented by each edge before removal stop words, it is transferred completely on newly-generated side, corresponding interdependent pass
Be different degree be newly-generated side on all dependence different degrees average value, it is of course also possible to select one it is representative according to
Dependence different degree of the relationship different degree as newly-generated side is deposited, without limitation to this embodiment of the present invention.As shown in figure 3,
It is interdependent after removing stop words for a word " structure that small-world network is a kind of Special complex network " in text to be processed
Syntax tree.Wherein, there are two types of dependences with "Yes" node before " network " node and " structure " knot-removal stop words, i.e.,
SVB and VOB, referring to fig. 2.After removing stop words, both dependences are transferred on newly-generated side, newly-generated side
Dependence different degree is the average value of described two dependence different degrees.
Step 105, small-world network is constructed using the interdependent syntax tree to be analyzed.
Small-world network is constructed according to the interdependent syntax tree after every words removal stop words, detailed process is as follows:
1) abortive haul network G=(V, E) is initialized;V indicates the set of node, and E indicates the set on side;
2) the interdependent syntax tree in text to be processed after every words removal stop words is successively obtained;
3) according to depth-first or the principle of breadth First, every interdependent syntax tree is successively traversed since root node;
4) when traversing a node, judge that present node whether there is in set V, if it is present under successively traversing
One node;If it does not exist, then present node is added in set V;
5) when traversing a line, judgement whether there is in set E when front, if it is present successively traversing next
Side, if it does not exist, then will be added in E when front;
6) judging whether all interdependent syntax trees of text to be processed traverse terminates, if terminating to execute step 7), otherwise holds
Row step is 2);
7) the interdependent syntax tree in text to be processed has all been traversed, small-world network G=(V, E) is obtained.
If Fig. 4 is that second word " in such networks most node each other and be not attached to " is gone in text to be processed
Except the interdependent syntax tree after stop words, Fig. 5 is that text third word to be processed " but passes through several steps just between most of node
It is reachable " removal stop words after interdependent syntax tree.
According to small-world network such as Fig. 6 institute of the interdependent syntax tree building after all removal stop words of text to be processed
Show, Fig. 6 is the corresponding part small-world network figure of text to be processed.
Step 106, analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network.
Specifically, each node in the interdependent syntax tree to be analyzed can be calculated according to the interdependent syntax tree to be analyzed
With the interdependent frequency of each edge, and each node and each edge in the small-world network are calculated according to the small-world network
Network correlated characteristic, the network correlated characteristic include: interdependency and/or betweenness center;Then according to the interdependent frequency
And the network correlated characteristic calculates the different degree score on each node and/or side in the small-world network.
Above-mentioned interdependent frequency, interdependency, the concept of betweenness center and calculation are described in detail below.
1) the interdependent frequency of each node and each edge in interdependent syntax tree to be analyzed is calculated according to interdependent syntax tree to be analyzed
Degree.
Phrase number dependent on current vocabulary is more, then the interdependent frequency of this vocabulary is higher, the phrase according to
Depositing in syntax tree is indicated using node.
The interdependent frequency of the node refers to identical as present node in all interdependent syntax trees to be analyzed of text to be processed
The sum of the different degree of node, the calculation method of the different degree is all nodes for directly relying on or indirectly relying on present node
Several square roots, in Fig. 3, the node for directly relying on " network " node has 2, and the node for indirectly relying on " network " node has 4
It is a, rely on number of nodes totally 6, then the square root that different degree of " network " node on the interdependent syntax tree of Fig. 3 is 6, i.e., 2.45.
Similarly, different degree of " network " node on the interdependent syntax tree of Fig. 4 be 1, if " network " word in text to be processed only
There is this twice, then the interdependent frequency of " network " node is 2.45+1=3.45.Shown in circular such as formula (1).
Wherein, NDDegiIndicate the interdependent frequency of i-th of node, ViIndicate number of nodes identical with i-th of node, Nproj
For all number of nodes for directly or indirectly relying on j-th of node.
The interdependent frequency on the side refer to occur in all interdependent syntax trees to be analyzed of text to be processed with when front phase
Same the sum of the dependence different degree on all sides, the same edge refer to that the node of the side connection is identical.Such as Fig. 3 " worldlet-
The dependence of network " this edge is ATT, and corresponding different degree is 1.0, if having also appeared " small a generation in whole network
The side on boundary-network ", dependence LAD, corresponding different degree is 0.6, then the interdependent frequency of " worldlet-network " this edge
Degree is 1.6, shown in circular such as formula (2):
Wherein, EDDegkIndicate the interdependent frequency on kth side, EkIndicate number of edges identical with kth side, IDegeIndicate the
The dependence different degree on e side.
2) interdependency of each node and each edge in network is calculated according to small-world network.
According to every kind of dependence different degree in interdependent syntactic relation, each node and each edge in the network are calculated
Interdependency.
The interdependency of the node refers to the sum of the dependence different degree on the side being connected in network with the node.Such as Fig. 3
In, " network " node shares 2 sides and is connected, and the dependence of a line is ATT, and corresponding dependence different degree is 1.0,
The dependence on Article 2 side is SBV-VOB, and corresponding dependence different degree is the flat of SBV and VOB dependence different degree
Mean value, i.e., 1.0.Therefore, the interdependency of " network " node is 2.0, as shown in formula (3).
Wherein, NIDegiIndicate the interdependency of i-th of node, NiIndicate the number on the side being connected with i-th of node, IDegk
Indicate the corresponding dependence different degree in kth side.
It is described while interdependency refer to described while the sum of two node interdependencies that connects, in Fig. 3, " worldlet-network "
The interdependency of this edge is the sum of the interdependency of " worldlet " node and " network " node, specific to calculate as shown in formula (4):
EIDegk=NIDegi1+NIDegi2 (4)
Wherein, EIDegkIndicate the interdependency on kth side, NIDegi1And NIDegi2Indicate two connect with kth side
The interdependency of node i 1 and i2.
3) betweenness center on each node or side in network is calculated according to small-world network
The betweenness center refers to that the node or side occur on the shortest path of any other two nodes in a network
Number, such as in Fig. 3, shortest path between " worldlet " node and " structure " node is " worldlet-network-structure ",
Shortest path length is 2, and " network " node has appeared on the shortest path of " worldlet " node and " structure " node, then " net
The betweenness center of network " node is 1, if " network " node also occurs on the shortest path between other two node, institute
The betweenness center for stating node is 2." worldlet-network " this edge also appears on shortest path, if the side does not occur
On shortest path between other nodes, then the betweenness center of " worldlet-network " this edge is 1.When calculating shortest path,
Conventional method can be used in the distance between adjacent node measurement, i.e., with 1 measurement, it is possible to use side is interdependent between two nodes
The inverse of frequency is measured.If the interdependent frequency on " worldlet-network " side in Fig. 3 is assumed to be 1.6, " worldlet " node and " net
The inverse that the distance between network " node measurement is 1.6, i.e., 0.625.
The interdependent frequency and interdependency and/or betweenness center these features are being calculated, can comprehensively utilize
These features determine the different degree score on each node and/or side in small-world network.It should be noted that in practical application
In, the different degree score on each node and/or side in small-world network can be calculated using these three features simultaneously, it can also
It, can also to calculate the different degree score on each node and/or side in small-world network using the interdependent frequency and interdependency
To calculate the different degree score on each node and/or side in small-world network using the interdependent frequency and betweenness center,
Without limitation to this embodiment of the present invention.
Below to calculate the different degree score on each node and/or side in small-world network using these three features simultaneously
For be illustrated.
Using above-mentioned three kinds of features as the three-dimensional feature of each lexical node in the text to be processed, due to every dimensional feature
Valued space it is different, can not directly utilize, therefore first the value to every dimensional feature can carry out regular, specific regular method can
It is carried out in a manner of using Ordering and marking, or uses other regular methods, if the characteristic value in every dimension is divided by current dimensional feature
The summation of value, obtain it is regular after characteristic value.
By taking Ordering and marking method as an example, sorted from small to large to every dimensional feature value, the index after characteristic value is sorted
As the score of current characteristic value, such as interdependent frequency of " network " node is 2.45, interdependency 2.0, betweenness center 2,
Index after sequence is respectively 3,6,10, then the three-dimensional feature score of the node is respectively 3,6,10.
Using the three-dimensional feature score after regular, the different degree score on each node and/or side in network can be calculated, specifically
As shown in formula (5):
Wherein, FScoreiFor i-th of node or the different degree score on side, ScoreijFor the jth on i-th of node or side dimension
The score of feature.R is the intrinsic dimensionality on each node or side, such as 3 dimensions.
Step 107, the hot information in the text to be processed is obtained according to analysis of central issue result.
Specifically, it can choose different degree score being connected to greater than phrase represented by the node of given threshold or side
Hot information in the text to be processed;Or select setting number (such as 10) from high to low according to different degree score
The hot information of phrase represented by node or side being connected in the text to be processed.In Fig. 6, three groups of heat of acquisition
Point information are as follows: network-node, network-structure, node-major part.
The hot information of the embodiment of the present invention finds method, and the building of small-world network is carried out according to interdependent syntactic analysis,
Can preferably stet sheet semantic information.After the completion of the network struction, calculates network correlated characteristic and sort, according to
Result after sequence carries out analysis of central issue, obtains the hot spot vocabulary relevant information in text to be processed according to analysis of central issue result,
So as to efficiently and accurately analyze the hot information of text to be processed, and then the speed of user version reading is effectively promoted,
Save reading time.
Correspondingly, the embodiment of the present invention also provides a kind of hot information discovery system, as shown in fig. 7, being the one of the system
Kind structural schematic diagram.
In this embodiment, the system comprises:
Text obtains module 701, for obtaining text to be processed;
Preprocessing module 702, for carrying out participle and part-of-speech tagging to the text to be processed;
Syntactic analysis module 703 obtains every in the text to be processed for carrying out syntactic analysis to the text after participle
The interdependent syntax tree of word;
Sorting module 704 is obtained for removing the stop words in text to be processed in the interdependent syntax tree of every words wait divide
Analyse interdependent syntax tree;
Network struction module 705, for constructing small-world network using the interdependent syntax tree to be analyzed;
Analysis of central issue module 706, for carrying out hot spot according to the interdependent syntax tree to be analyzed and the small-world network
Analysis;
Hot information obtains module 707, for obtaining the letter of the hot spot in the text to be processed according to analysis of central issue result
Breath.
Above-mentioned preprocessing module 702 can segment the text to be processed using the method based on condition random field
And part-of-speech tagging.Above-mentioned syntactic analysis module 703 can use maximum spanning tree algorithm or method pair neural network based
Text after participle carries out interdependent syntactic analysis, obtains the interdependent syntax tree of every words in the text to be processed.Certainly, this two
A module can also complete participle, part-of-speech tagging and the process of syntactic analysis using other methods, to this embodiment of the present invention
Without limitation.
It should be noted that interdependent syntax tree of the sorting module 704 for every words in text to be processed, according to phase
Same principle removes stop words therein, and the node after removal stop words is attached.For example, being relied on according to right node left
Node after removal stop words is attached, or relies on the principle of right node according to left sibling by the principle of node, will remove
Node after stop words is attached.In addition, also by dependence represented by each edge before removal stop words, all
It is transferred on newly-generated side.Furthermore it is also possible to set all interdependent on newly-generated side for corresponding dependence different degree
The average value of relationship different degree, it is of course also possible to select representative dependence different degree as newly-generated side according to
Deposit relationship different degree.
In practical applications, the analysis of central issue module 706 can by calculate small-world network in each node and/or
The different degree score on side carries out analysis of central issue.A kind of specific structure of the module includes: interdependent frequency computing module, feature meter
Calculate module and different degree points calculating module;The feature calculation module includes: interdependency computing module and/or betweenness center
Computing module.Wherein:
The interdependent frequency computing module, for calculating the interdependent sentence to be analyzed according to the interdependent syntax tree to be analyzed
The interdependent frequency of each node and each edge in method tree, the interdependent frequency of the node refer to needing point for the text to be processed
The sum of the different degree of node identical with the node in interdependent syntax tree is analysed, the interdependent frequency on the side refers to text to be processed
The sum of the dependence different degree on all sides identical with front is worked as occurred in all interdependent syntax trees to be analyzed is described identical
The node connected when referring to described is identical;
The interdependency computing module, for calculating each node in the small-world network according to the small-world network
With the interdependency of each edge, the interdependency of the node refers to the dependence on the side being connected in the small-world network with the node
The sum of different degree, it is described while interdependency refer to described while the sum of two node interdependencies that connects;
The betweenness center computing module, it is each in the small-world network for being calculated according to the small-world network
The betweenness center of node and each edge, the betweenness center refer to that the node or side appear in its in the small-world network
Number on the shortest path of his any two node;
The different degree points calculating module, for according to the interdependent frequency and network correlated characteristic calculating
The different degree score on each node and/or side in small-world network, the network correlated characteristic include: the interdependency, and/or
Betweenness center.
Correspondingly, above-mentioned hot information obtain module 707 can choose node of the different degree score greater than given threshold or
The hot information of phrase represented by side being connected in the text to be processed;Or from high to low according to different degree score
The hot information of phrase represented by the node of selection setting number or side being connected in the text to be processed.
The hot information of the embodiment of the present invention finds system, and the building of small-world network is carried out according to interdependent syntactic analysis,
Can preferably stet sheet semantic information.After the completion of the network struction, calculates network correlated characteristic and sort, according to
Result after sequence carries out analysis of central issue, obtains the hot spot vocabulary relevant information in text to be processed according to analysis of central issue result,
So as to efficiently and accurately analyze the hot information of text to be processed, and then the speed of user version reading is effectively promoted,
Save reading time.
It should be noted that the hot information of the embodiment of the present invention finds method and system, natural language can be applied to
The fields such as processing, information search, information processing can be obtained efficiently and accurately the hot spot word to play an important role in text to be processed
Remittance relevant information.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality
For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method
Part explanation.System embodiment described above is only schematical, wherein described be used as separate part description
Unit may or may not be physically separated, component shown as a unit may or may not be
Physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to the actual needs
Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying
In the case where creative work, it can understand and implement.
The embodiment of the present invention has been described in detail above, and specific embodiment used herein carries out the present invention
It illustrates, method and system of the invention that the above embodiments are only used to help understand;Meanwhile for the one of this field
As technical staff, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute
It states, the contents of this specification are not to be construed as limiting the invention.
Claims (12)
1. a kind of hot information finds method characterized by comprising
Obtain text to be processed;
Participle and part-of-speech tagging are carried out to the text to be processed;
Syntactic analysis is carried out to the text after participle, obtains the interdependent syntax tree of every words in the text to be processed;
The stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent syntax tree to be analyzed is obtained;
Small-world network is constructed using the interdependent syntax tree to be analyzed;
Analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network, including is calculated separately described wait divide
The network correlated characteristic for analysing each element in the interdependent frequency and the small-world network of each element in interdependent syntax tree, according to institute
It states interdependent frequency and the network correlated characteristic carries out analysis of central issue;
The hot information in the text to be processed is obtained according to analysis of central issue result.
2. the method according to claim 1, wherein described carry out participle and part of speech mark to the text to be processed
Note includes:
Participle and part-of-speech tagging are carried out to the text to be processed using the method based on condition random field.
3. being obtained the method according to claim 1, wherein the text after described pair of participle carries out syntactic analysis
The interdependent syntax tree of every words includes: in the text to be processed
Interdependent syntactic analysis is carried out to the text after participle using maximum spanning tree algorithm or method neural network based, is obtained
The interdependent syntax tree of every words into the text to be processed.
4. the method according to claim 1, wherein every interdependent syntax talked about in the removal text to be processed
Stop words in tree, obtaining interdependent syntax tree to be analyzed includes:
For the interdependent syntax tree of every words in text to be processed, stop words therein is removed according to identical principle, and will go
Except the node after stop words is attached;
By dependence represented by each edge before removal stop words, it is transferred completely on newly-generated side, and will correspond to
Dependence different degree be set as the average value of all dependence different degrees on newly-generated side.
5. method according to any one of claims 1 to 4, which is characterized in that
The interdependent frequency of each element includes: in the calculating interdependent syntax tree to be analyzed
The interdependent of each node and each edge in the interdependent syntax tree to be analyzed is calculated according to the interdependent syntax tree to be analyzed
Frequency, the interdependent frequency of the node refer to identical as the node in all interdependent syntax trees to be analyzed of the text to be processed
The sum of the different degree of node, the interdependent frequency on the side, which refers to, to be occurred in all interdependent syntax trees to be analyzed of text to be processed
The node of the sum of the dependence different degree on all sides identical with front is worked as, the identical side Bian Zhiyu connection is identical;
The network correlated characteristic of each element includes: in the calculating small-world network
The network correlated characteristic of each node and each edge in the small-world network is calculated according to the small-world network, it is described
Network correlated characteristic includes: interdependency and/or betweenness center, and the interdependency of the node refers in the small-world network and is somebody's turn to do
Node connected the sum of the dependence different degree on side, it is described while interdependency refer to described while two node interdependencies connecting
With the betweenness center refers to that the node or side appear in the shortest path of any other two nodes in the small-world network
Number on diameter;
It is described to include: according to the interdependent frequency and network correlated characteristic progress analysis of central issue
Each node and/or the weight on side in the small-world network are calculated according to the interdependent frequency and the network correlated characteristic
Spend score.
6. according to the method described in claim 5, it is characterized in that, described obtain the text to be processed according to analysis of central issue result
Hot information in this includes:
Selection different degree score is connected to the text to be processed greater than phrase represented by the node of given threshold or side
In hot information;Or
Selected from high to low according to different degree score setting number node or side represented by described in being connected to of phrase to
Handle the hot information in text.
7. a kind of hot information finds system characterized by comprising
Text obtains module, for obtaining text to be processed;
Preprocessing module, for carrying out participle and part-of-speech tagging to the text to be processed;
Syntactic analysis module obtains in the text to be processed every words for carrying out syntactic analysis to the text after participle
Interdependent syntax tree;
Sorting module obtains to be analyzed interdependent for removing the stop words in text to be processed in the interdependent syntax tree of every words
Syntax tree;
Network struction module, for constructing small-world network using the interdependent syntax tree to be analyzed;
Analysis of central issue module, for carrying out analysis of central issue, packet according to the interdependent syntax tree to be analyzed and the small-world network
It includes and calculates separately in the interdependent syntax tree to be analyzed each element in the interdependent frequency and the small-world network of each element
Network correlated characteristic carries out analysis of central issue according to the interdependent frequency and the network correlated characteristic;
Hot information obtains module, for obtaining the hot information in the text to be processed according to analysis of central issue result.
8. system according to claim 7, which is characterized in that the preprocessing module uses the side based on condition random field
Method carries out participle and part-of-speech tagging to the text to be processed.
9. system according to claim 7, which is characterized in that the syntactic analysis module using maximum spanning tree algorithm or
Person's method neural network based carries out interdependent syntactic analysis to the text after participle, obtains every words in the text to be processed
Interdependent syntax tree.
10. system according to claim 7, which is characterized in that
The sorting module is gone specifically for the interdependent syntax tree for every words in text to be processed according to identical principle
It is attached except stop words therein, and by the node after removal stop words;It will be represented by each edge before removal stop words
Dependence, be transferred completely on newly-generated side, and set institute on newly-generated side for corresponding dependence different degree
There is the average value of dependence different degree.
11. according to the described in any item systems of claim 7 to 10, which is characterized in that the analysis of central issue module includes: interdependent
Frequency computing module, feature calculation module and different degree points calculating module;The feature calculation module includes: that interdependency calculates
Module and/or betweenness center computing module;
The interdependent frequency computing module, for calculating the interdependent syntax tree to be analyzed according to the interdependent syntax tree to be analyzed
In each node and each edge interdependent frequency, the interdependent frequency of the node refer to the text to be processed it is all it is to be analyzed according to
The sum of the different degree of node identical with the node in syntax tree is deposited, the interdependent frequency on the side refers to all of text to be processed
The sum of the dependence different degree on all sides identical with front is worked as occurred in interdependent syntax tree to be analyzed, identical Bian Zhiyu
The node of side connection is identical;
The interdependency computing module, for according to each node in the small-world network calculating small-world network and often
The interdependency on side, the interdependency of the node refer to that the dependence on the side being connected in the small-world network with the node is important
The sum of degree, it is described while interdependency refer to described while the sum of two node interdependencies that connects;
The betweenness center computing module, for calculating each node in the small-world network according to the small-world network
With the betweenness center of each edge, the betweenness center refers to that the node or side appear in other in the small-world network
The number anticipated on the shortest path of two nodes;
The different degree points calculating module, for calculating the small generation according to the interdependent frequency and the network correlated characteristic
The different degree score on each node and/or side in boundary's network, the network correlated characteristic includes: the interdependency, and/or betweenness
Centrality.
12. system according to claim 11, which is characterized in that
The hot information obtains module, is greater than represented by node or the side of given threshold specifically for selection different degree score
The hot information of phrase being connected in the text to be processed;Or select setting from high to low according to different degree score
The hot information of phrase represented by several nodes or side being connected in the text to be processed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510137773.9A CN106156041B (en) | 2015-03-26 | 2015-03-26 | Hot information finds method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510137773.9A CN106156041B (en) | 2015-03-26 | 2015-03-26 | Hot information finds method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106156041A CN106156041A (en) | 2016-11-23 |
CN106156041B true CN106156041B (en) | 2019-05-28 |
Family
ID=57339541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510137773.9A Active CN106156041B (en) | 2015-03-26 | 2015-03-26 | Hot information finds method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106156041B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108133014B (en) * | 2017-12-22 | 2022-03-22 | 广州数说故事信息科技有限公司 | Triple generation method and device based on syntactic analysis and clustering and user terminal |
CN110852095B (en) * | 2018-08-02 | 2023-09-19 | 中国银联股份有限公司 | Statement hot spot extraction method and system |
CN109062902B (en) * | 2018-08-17 | 2022-12-06 | 科大讯飞股份有限公司 | Text semantic expression method and device |
CN110069624B (en) * | 2019-04-28 | 2021-05-04 | 北京小米智能科技有限公司 | Text processing method and device |
CN111209746B (en) * | 2019-12-30 | 2024-01-30 | 航天信息股份有限公司 | Natural language processing method and device, storage medium and electronic equipment |
CN110874531B (en) * | 2020-01-20 | 2020-07-10 | 湖南蚁坊软件股份有限公司 | Topic analysis method and device and storage medium |
CN111339751A (en) * | 2020-05-15 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Text keyword processing method, device and equipment |
CN113434751B (en) * | 2021-07-14 | 2023-06-02 | 国际关系学院 | Network hotspot artificial intelligent early warning system and method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101661513A (en) * | 2009-10-21 | 2010-03-03 | 上海交通大学 | Detection method of network focus and public sentiment |
CN102567405A (en) * | 2010-12-31 | 2012-07-11 | 北京安码科技有限公司 | Hotspot discovery method based on improved text space vector representation |
US20130204882A1 (en) * | 2012-02-07 | 2013-08-08 | Social Market Analytics, Inc. | Systems And Methods of Detecting, Measuring, And Extracting Signatures of Signals Embedded in Social Media Data Streams |
CN103473223A (en) * | 2013-09-25 | 2013-12-25 | 中国科学院计算技术研究所 | Rule extraction and translation method based on syntax tree |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN104516874A (en) * | 2014-12-29 | 2015-04-15 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Method and system for parsing dependency of noun phrases |
CN105095288A (en) * | 2014-05-14 | 2015-11-25 | 腾讯科技(深圳)有限公司 | Data analysis method and data analysis device |
-
2015
- 2015-03-26 CN CN201510137773.9A patent/CN106156041B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101661513A (en) * | 2009-10-21 | 2010-03-03 | 上海交通大学 | Detection method of network focus and public sentiment |
CN102567405A (en) * | 2010-12-31 | 2012-07-11 | 北京安码科技有限公司 | Hotspot discovery method based on improved text space vector representation |
US20130204882A1 (en) * | 2012-02-07 | 2013-08-08 | Social Market Analytics, Inc. | Systems And Methods of Detecting, Measuring, And Extracting Signatures of Signals Embedded in Social Media Data Streams |
CN103473223A (en) * | 2013-09-25 | 2013-12-25 | 中国科学院计算技术研究所 | Rule extraction and translation method based on syntax tree |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN105095288A (en) * | 2014-05-14 | 2015-11-25 | 腾讯科技(深圳)有限公司 | Data analysis method and data analysis device |
CN104516874A (en) * | 2014-12-29 | 2015-04-15 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Method and system for parsing dependency of noun phrases |
Also Published As
Publication number | Publication date |
---|---|
CN106156041A (en) | 2016-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106156041B (en) | Hot information finds method and system | |
Ren et al. | On querying historical evolving graph sequences | |
CN105808526B (en) | Commodity short text core word extracting method and device | |
CN104102745B (en) | Complex network community method for digging based on Local Minimum side | |
CN105224648A (en) | A kind of entity link method and system | |
CN105447081A (en) | Cloud platform-oriented government affair and public opinion monitoring method | |
CN105045847B (en) | A kind of method that Chinese institutional units title is extracted from text message | |
CN103324666A (en) | Topic tracing method and device based on micro-blog data | |
WO2008016495A2 (en) | Determination of graph connectivity metrics using bit-vectors | |
US20170109633A1 (en) | Comment-comment and comment-document analysis of documents | |
WO2014127673A1 (en) | Method and apparatus for acquiring hot topics | |
CN105302882B (en) | Obtain the method and device of keyword | |
CN107562772A (en) | Event extraction method, apparatus, system and storage medium | |
CN110321466A (en) | A kind of security information duplicate checking method and system based on semantic analysis | |
CN106294418B (en) | Search method and searching system | |
CN106547864A (en) | A kind of Personalized search based on query expansion | |
CN106886579A (en) | Real-time streaming textual hierarchy monitoring method and device | |
Afzaal et al. | A novel framework for aspect-based opinion classification for tourist places | |
CN108304382A (en) | Mass analysis method based on manufacturing process text data digging and system | |
Sarkar et al. | A comparative analysis of particle swarm optimization and K-means algorithm for text clustering using Nepali Wordnet | |
CN103095849B (en) | A method and a system of spervised web service finding based on attribution forecast and error correction of quality of service (QoS) | |
AU2018312543B2 (en) | Systems and methods for extracting structure from large, dense, and noisy networks | |
CN111680498A (en) | Entity disambiguation method, device, storage medium and computer equipment | |
CN104408036B (en) | It is associated with recognition methods and the device of topic | |
CN109299463A (en) | A kind of calculation method and relevant device of emotion score |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |