CN106156041A - Hot information finds method and system - Google Patents

Hot information finds method and system Download PDF

Info

Publication number
CN106156041A
CN106156041A CN201510137773.9A CN201510137773A CN106156041A CN 106156041 A CN106156041 A CN 106156041A CN 201510137773 A CN201510137773 A CN 201510137773A CN 106156041 A CN106156041 A CN 106156041A
Authority
CN
China
Prior art keywords
node
interdependent
syntax tree
limit
pending text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510137773.9A
Other languages
Chinese (zh)
Other versions
CN106156041B (en
Inventor
吴及
侯晋峰
胡国平
吕萍
王影
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
iFlytek Co Ltd
Original Assignee
Tsinghua University
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, iFlytek Co Ltd filed Critical Tsinghua University
Priority to CN201510137773.9A priority Critical patent/CN106156041B/en
Publication of CN106156041A publication Critical patent/CN106156041A/en
Application granted granted Critical
Publication of CN106156041B publication Critical patent/CN106156041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a kind of hot information and find method and system, the method includes: obtain pending text;Described pending text is carried out participle and part-of-speech tagging;Text after participle is carried out syntactic analysis, obtains the interdependent syntax tree of every words in described pending text;Remove the stop words in the interdependent syntax tree of every words in pending text, obtain interdependent syntax tree to be analyzed;Described interdependent syntax tree to be analyzed is utilized to build small-world network;Analysis of central issue is carried out according to described interdependent syntax tree to be analyzed and described small-world network;The hot information in described pending text is obtained according to analysis of central issue result.Utilize the present invention, the hot information in pending text can be found efficiently and accurately.

Description

Hot information finds method and system
Technical field
The present invention relates to data mining technology field, be specifically related to a kind of hot information and find method and system.
Background technology
Along with fast development and the continuous progress of memory technology of the Internet, increasing text message is flooded with around us.But, these information also exist substantial amounts of redundancy, step-by-step reading obviously can waste the substantial amounts of time and efforts of user.Analysis of central issue method promptly can extract vocabulary or the sentence information of key from substantial amounts of text message, i.e. hot information, allow user can recognize the important information included in text conveniently and efficiently, thus become the study hotspot of research worker, therefore, how efficiently and accurately text can be carried out analysis of central issue, to find corresponding hot information in pending text to become the top priority of analysis of central issue.
Existing analysis of central issue method is generally based on vocabulary co-occurrence method and builds small-world network, according to the importance degree of each node in described network calculations network, determines the hot information of pending text according to described importance degree information.The calculating of described importance degree is that the average shortest path length variable quantity according to described network determines.When existing method carries out described network struction, the most not considering the semantic information between vocabulary, the network of structure is only measured according to the distance of adjacent words.But, if two vocabulary are the most relatively far apart, but contact very tight semantically, existing method then cannot find this contact.Additionally, existing method is when calculating the importance degree of each node, simply using shortest path to measure the importance degree of each node, feature is more single.The vocabulary that the importance degree that uses existing method to obtain is higher, it is not necessary to original text semantic information can be represented.When calculating the importance degree of each node, need shortest path all of in network is calculated simultaneously every time, inefficient.
Summary of the invention
The embodiment of the present invention provides a kind of hot information to find method and system, in order to find the hot information in pending text efficiently and accurately.
To this end, the embodiment of the present invention following technical scheme of offer:
A kind of hot information finds method, including:
Obtain pending text;
Described pending text is carried out participle and part-of-speech tagging;
Text after participle is carried out syntactic analysis, obtains the interdependent syntax tree of every words in described pending text;
Remove the stop words in the interdependent syntax tree of every words in pending text, obtain interdependent syntax tree to be analyzed;
Described interdependent syntax tree to be analyzed is utilized to build small-world network;
Analysis of central issue is carried out according to described interdependent syntax tree to be analyzed and described small-world network;
The hot information in described pending text is obtained according to analysis of central issue result.
Preferably, described described pending text is carried out participle and part-of-speech tagging includes:
Use method based on condition random field that described pending text is carried out participle and part-of-speech tagging.
Preferably, described text after participle is carried out syntactic analysis, obtains the interdependent syntax tree of every words in described pending text and include:
Use maximum spanning tree algorithm or method based on neutral net that the text after participle is carried out interdependent syntactic analysis, obtain the interdependent syntax tree of every words in described pending text.
Preferably, the stop words in the interdependent syntax tree of every words in the pending text of described removal, obtain interdependent syntax tree to be analyzed and include:
For the interdependent syntax tree of every words in pending text, remove stop words therein according to identical principle, and the node after removing stop words is attached;
By removing the dependence represented by each edge before stop words, it is transferred completely on newly-generated limit, and corresponding dependence importance degree is set to the meansigma methods of all dependence importance degrees on newly-generated limit.
Preferably, described carry out analysis of central issue according to described interdependent syntax tree to be analyzed and described small-world network and include:
Each node and the interdependent frequency of each edge in described interdependent syntax tree to be analyzed is calculated according to described interdependent syntax tree to be analyzed, the interdependent frequency of described node refers to the importance degree sum of node identical with described node in the interdependent syntax tree all to be analyzed of described pending text, the interdependent frequency on described limit refer to the interdependent syntax tree all to be analyzed of pending text occurs with when the dependence importance degree sum on identical all limits, front, described same edge refers to that the node that described limit connects is identical;
Each node and the network correlated characteristic of each edge in described small-world network is calculated according to described small-world network, described network correlated characteristic includes: interdependency and/or betweenness centrality, the interdependency of described node refers to the dependence importance degree sum on the limit being connected in described small-world network with this node, the interdependency on described limit refers to the sum of two node interdependencies that described limit connects, and described betweenness centrality refers to that described node or limit occur in the number of times in described small-world network on the shortest path of other any two nodes;
Each node and/or the importance degree score on limit in described small-world network is calculated according to described interdependent frequency and described network correlated characteristic.
Preferably, the described hot information obtained in described pending text according to analysis of central issue result includes:
Select importance degree score connecting as the hot information in described pending text more than the phrase set represented by the node of threshold value or limit;Or
Select the connection setting the phrase represented by the node of number or limit as the hot information in described pending text from high to low according to importance degree score.
A kind of hot information finds system, including:
Text acquisition module, is used for obtaining pending text;
Pretreatment module, for carrying out participle and part-of-speech tagging to described pending text;
Syntactic analysis module, for the text after participle is carried out syntactic analysis, obtains the interdependent syntax tree of every words in described pending text;
Sorting module, for the stop words removed in pending text in the interdependent syntax tree of every words, obtains interdependent syntax tree to be analyzed;
Network struction module, is used for utilizing described interdependent syntax tree to be analyzed to build small-world network;
Analysis of central issue module, for carrying out analysis of central issue according to described interdependent syntax tree to be analyzed and described small-world network;
Hot information acquisition module, for obtaining the hot information in described pending text according to analysis of central issue result.
Preferably, described pretreatment module uses method based on condition random field that described pending text is carried out participle and part-of-speech tagging.
Preferably, described syntactic analysis module uses maximum spanning tree algorithm or method based on neutral net that the text after participle is carried out interdependent syntactic analysis, obtains the interdependent syntax tree of every words in described pending text.
Preferably, described sorting module, specifically for for the interdependent syntax tree of every words in pending text, remove stop words therein according to identical principle, and the node after removing stop words is attached;By removing the dependence represented by each edge before stop words, it is transferred completely on newly-generated limit, and corresponding dependence importance degree is set to the meansigma methods of all dependence importance degrees on newly-generated limit.
Preferably, described analysis of central issue module includes: interdependent frequency computing module, feature calculation module and importance degree points calculating module;Described feature calculation module includes: interdependency computing module and/or betweenness centrality computing module;
Described interdependent frequency computing module, for calculating each node and the interdependent frequency of each edge in described interdependent syntax tree to be analyzed according to described interdependent syntax tree to be analyzed, the interdependent frequency of described node refers to the importance degree sum of node identical with described node in the interdependent syntax tree all to be analyzed of described pending text, the interdependent frequency on described limit refer to the interdependent syntax tree all to be analyzed of pending text occurs with when the dependence importance degree sum on identical all limits, front, described same edge refers to that the node that described limit connects is identical;
Described interdependency computing module, for calculating each node and the interdependency of each edge in described small-world network according to described small-world network, the interdependency of described node refers to the dependence importance degree sum on the limit being connected in described small-world network with this node, and the interdependency on described limit refers to the sum of two node interdependencies that described limit connects;
Described betweenness centrality computing module, for calculating each node and the betweenness centrality of each edge in described small-world network according to described small-world network, described betweenness centrality refers to that described node or limit occur in the number of times in described small-world network on the shortest path of other any two nodes;
Described importance degree points calculating module, for calculating each node and/or the importance degree score on limit in described small-world network according to described interdependent frequency and described network correlated characteristic, described network correlated characteristic includes: described interdependency and/or betweenness centrality.
Preferably, described hot information acquisition module, specifically for selecting importance degree score connecting as the hot information in described pending text more than the phrase represented by the node setting threshold value or limit;Or select the connection setting the phrase represented by the node of number or limit as the hot information in described pending text from high to low according to importance degree score.
The hot information that the embodiment of the present invention provides finds method and system, carries out the structure of small-world network according to interdependent syntactic analysis, can preferably this semantic information of stet.After described network struction completes, calculate network correlated characteristic and sort, analysis of central issue is carried out according to the result after sequence, the focus vocabulary relevant information in pending text is obtained according to analysis of central issue result, such that it is able to analyze the hot information of pending text efficiently and accurately, and then effectively promote the speed that user version is read, save reading time.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, the accompanying drawing used required in embodiment will be briefly described below, apparently, accompanying drawing in describing below is only some embodiments described in the present invention, for those of ordinary skill in the art, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is a kind of flow chart that embodiment of the present invention hot information finds method;
Fig. 2 is the interdependent syntax tree example one before removing stop words in the embodiment of the present invention;
Fig. 3 is the interdependent syntax tree example one after removing stop words in the embodiment of the present invention;
Fig. 4 is interdependent syntax tree example two in the embodiment of the present invention;
Fig. 5 is interdependent syntax tree example three in the embodiment of the present invention;
Fig. 6 is the small-world network part illustrated example built in the embodiment of the present invention;
Fig. 7 is a kind of structural representation that embodiment of the present invention hot information finds system.
Detailed description of the invention
In order to make those skilled in the art be more fully understood that the scheme of the embodiment of the present invention, with embodiment, the embodiment of the present invention is described in further detail below in conjunction with the accompanying drawings.
As it is shown in figure 1, be a kind of flow chart of embodiment of the present invention hot information discovery method, comprise the following steps:
Step 101, obtains pending text.
Step 102, carries out participle and part-of-speech tagging to described pending text.
Such as, can use method based on condition random field that described pending text is carried out participle and part-of-speech tagging.Certainly, it is possible to use other method to carry out participle and part-of-speech tagging, as participle can be by the coupling of long word, part-of-speech tagging can be by method etc. based on HMM (Hidden Markov Model, hidden Markov model).
Step 103, carries out syntactic analysis to the text after participle, obtains the interdependent syntax tree of every words in described pending text.
Such as, can use maximum spanning tree algorithm or method based on neutral net that the text after participle is carried out interdependent syntactic analysis, obtain the interdependent syntax tree of every words in described pending text.
Such as, a word in pending text is " structure that small-world network is a class Special complex network ", and its interdependent syntax tree is as shown in Figure 2.Wherein, the letter abbreviations on limit is dependence, and every kind of dependence is endowed different importance degrees, as shown in table 1 below.
Table 1:
Dependence Importance degree
Relation ATT in fixed 1.0
Subject-predicate relation SBV 1.0
Dynamic guest's relation VOB 1.0
Quantitative relation QUN 0.9
" " word structure DE 0.5
Step 104, removes the stop words in the interdependent syntax tree of every words in pending text, obtains interdependent syntax tree to be analyzed.
Described stop words refers to word nonsensical in pending text, such as " this ", and "Yes", " uh ".
When removing stop words, for all interdependent syntax tree in pending text, based on identical principle, remove stop words therein.Such as, relying on the principle of left sibling according to right node, the node after removing stop words is attached.For another example, it is also possible to rely on the principle of right node according to left sibling, the node after removing stop words is attached.Additionally, can also will remove the dependence represented by each edge before stop words, it is transferred completely on newly-generated limit, corresponding dependence importance degree is the meansigma methods of all dependence importance degrees on newly-generated limit, certainly, a representational dependence importance degree can also be selected as the dependence importance degree on newly-generated limit, this embodiment of the present invention is not limited.As it is shown on figure 3, remove the interdependent syntax tree after stop words in pending text the in short the structure of a class Special complex network " small-world network be ".Wherein, there are two kinds of dependences, i.e. SVB and VOB with "Yes" node before " network " node and " structure " knot-removal stop words, see Fig. 2.After removing stop words, both dependences are transferred on newly-generated limit, the meansigma methods that dependence importance degree is the two dependence importance degree on newly-generated limit.
Step 105, utilizes described interdependent syntax tree to be analyzed to build small-world network.
Interdependent syntax tree after removing stop words according to every words builds small-world network, and detailed process is as follows:
1) abortive haul network G=(V, E) is initialized;V represents the set of node, and E represents the set on limit;
2) the interdependent syntax tree after every words remove stop words in pending text is obtained successively;
3) according to depth-first or the principle of breadth First, start to travel through every interdependent syntax tree successively from root node;
4) when traversing a node, it is judged that whether present node exists in set V, if it is present travel through next node successively;If it does not exist, then present node is joined in set V;
5) when traversing a limit, it is judged that in whether front exists set E, if it is present travel through next limit successively, join in E if it does not exist, then front will be worked as;
6) judge whether all interdependent syntax tree of pending text travels through to terminate, if terminating to perform step 7), otherwise perform step 2);
7) all travel through the described interdependent syntax tree in pending text, obtain small-world network G=(V, E).
As Fig. 4 be the second word in pending text " the most most node each other and is not attached to " remove the interdependent syntax tree after stop words, Fig. 5 is the interdependent syntax tree after the removal stop words of pending text the 3rd word " but just can arrive through several steps between major part node ".
As shown in Figure 6, Fig. 6 is the part small-world network figure that pending text is corresponding to small-world network according to the interdependent syntax tree structure after all removal stop words of pending text.
Step 106, carries out analysis of central issue according to described interdependent syntax tree to be analyzed and described small-world network.
Specifically, each node and the interdependent frequency of each edge in described interdependent syntax tree to be analyzed can be calculated according to described interdependent syntax tree to be analyzed, and calculating each node and the network correlated characteristic of each edge in described small-world network according to described small-world network, described network correlated characteristic includes: interdependency and/or betweenness centrality;Then each node and/or the importance degree score on limit in described small-world network is calculated according to described interdependent frequency and described network correlated characteristic.
Below above-mentioned interdependent frequency, interdependency, the central concept of betweenness and calculation are described in detail.
1) each node and the interdependent frequency of each edge in interdependent syntax tree to be analyzed is calculated according to interdependent syntax tree to be analyzed.
The phrase number depending on current vocabulary is the most, then the interdependent frequency of this vocabulary is the highest, and described phrase uses node to represent in interdependent syntax tree.
The interdependent frequency of described node refers to the importance degree sum of node identical with present node in the interdependent syntax tree all to be analyzed of pending text, the computational methods of described importance degree are the square root of all nodes directly relying on or indirectly relying on present node, in Fig. 3, the node directly relying on " network " node has 2, the node indirectly relying on " network " node has 4, rely on nodes totally 6, then " network " node square root that importance degree is 6 on the interdependent syntax tree of Fig. 3, i.e. 2.45.In like manner, " network " node importance degree on the interdependent syntax tree of Fig. 4 is 1, if " network " word only occurred this twice in pending text, then the interdependent frequency of " network " node is 2.45+1=3.45.Shown in circular such as formula (1).
NDDeg i = Σ j = 1 V i Npro j - - - ( 1 )
Wherein, NDDegiRepresent the interdependent frequency of i-th node, ViRepresent the nodes identical with i-th node, NprojFor directly or indirectly relying on all nodes of jth node.
The interdependent frequency on described limit refer to the interdependent syntax tree all to be analyzed of pending text occurs with when the dependence importance degree sum on identical all limits, front, described same edge refers to that the node that described limit connects is identical.Dependence such as Fig. 3 " worldlet-network " this edge is ATT, corresponding importance degree is 1.0, if whole network have also appeared the limit of " worldlet-network ", dependence is LAD, corresponding importance degree is 0.6, so the interdependent frequency of " worldlet-network " this edge is 1.6, shown in circular such as formula (2):
EDDeg k = Σ e = 1 E k IDeg e - - - ( 2 )
Wherein, EDDegkRepresent the interdependent frequency on kth bar limit, EkRepresent the limit number identical with kth bar limit, IDegeRepresent the dependence importance degree on e article of limit.
2) each node and the interdependency of each edge in network is calculated according to small-world network.
According to every kind of dependence importance degree in interdependent syntactic relation, calculate each node and the interdependency of each edge in described network.
The interdependency of described node refers to the dependence importance degree sum on the limit being connected in network with this node.In Fig. 3, " network " node has 2 limits and is connected, and the dependence on Article 1 limit is ATT, corresponding dependence importance degree is 1.0, the dependence on Article 2 limit is SBV-VOB, the corresponding meansigma methods that dependence importance degree is SBV and VOB dependence importance degree, i.e. 1.0.Therefore, the interdependency of " network " node is 2.0, as shown in formula (3).
NIDeg i = Σ k = 1 N i IDeg k - - - ( 3 )
Wherein, NIDegiRepresent the interdependency of i-th node, NiRepresent the number on the limit being connected with i-th node, IDegkRepresent the dependence importance degree that kth bar limit is corresponding.
The interdependency on described limit refers to the sum of two node interdependencies that described limit connects, in Fig. 3, the interdependency sum that interdependency is " worldlet " node and " network " node of " worldlet-network " this edge, specifically calculates as shown in formula (4):
EIDegk=NIDegi1+NIDegi2 (4)
Wherein, EIDegkRepresent the interdependency on kth bar limit, NIDegi1And NIDegi2Represent the interdependency of two node i 1 and i2 being connected with kth bar limit.
3) each node or the betweenness centrality on limit in network is calculated according to small-world network
Described betweenness centrality refers to that the number of times on the shortest path of other any two nodes in a network occur in described node or limit, as in figure 3, shortest path between " worldlet " node and " structure " node is " worldlet-network-structure ", shortest path length is 2, " network " node has occurred on the shortest path of " worldlet " node and " structure " node, then the betweenness centrality of " network " node is 1, if " network " node also occurs on the shortest path between two other node, the betweenness centrality of the most described node is 2." worldlet-network " this edge also appears on shortest path, if described limit does not appears on other internodal shortest path, then the betweenness centrality of " worldlet-network " this edge is 1.When calculating shortest path, the distance metric between adjacent node can use traditional method, i.e. by 1 tolerance, it is possible to use between two nodes, the inverse of the interdependent frequency on limit is measured.As in Fig. 3, the interdependent frequency on " worldlet-network " limit is assumed to be 1.6, the inverse that distance metric is 1.6 between " worldlet " node and " network " node, i.e. 0.625.
It is being calculated described interdependent frequency and interdependency and/or these features of betweenness centrality, these features can comprehensively utilized to determine the importance degree score on each node and/or limit in small-world network.It should be noted that, in actual applications, these three feature can be utilized to calculate the importance degree score on each node and/or limit in small-world network simultaneously, described interdependent frequency and interdependency can also be utilized to calculate the importance degree score on each node and/or limit in small-world network, described interdependent frequency and betweenness centrality can also be utilized to calculate the importance degree score on each node and/or limit in small-world network, this embodiment of the present invention is not limited.
Below to utilize these three feature to illustrate as a example by calculating the importance degree score on each node and/or limit in small-world network simultaneously.
Using above-mentioned three kinds of features as the three-dimensional feature of each lexical node in described pending text, owing to the valued space of every dimensional feature is different, cannot directly utilize, therefore can first the value of every dimensional feature be carried out regular, concrete regular method can be carried out in the way of using Ordering and marking, or use other regular method, if the eigenvalue in every dimension is divided by the summation of leading dimension eigenvalue, obtain regular after eigenvalue.
As a example by Ordering and marking method, sorting every dimensional feature value from small to large, the index after being sorted by eigenvalue is as the score of current characteristic value, if the interdependent frequency of " network " node is 2.45, interdependency is 2.0, betweenness centrality is 2, and the index after sequence is respectively 3,6,10, the three-dimensional feature score of the most described node is respectively 3,6,10.
Utilize regular after three-dimensional feature score, the importance degree score on each node and/or limit in network can be calculated, concrete as shown in formula (5):
FScore i = Σ j = 1 R Score ij - - - ( 5 )
Wherein, FScoreiFor the importance degree score on i-th node or limit, ScoreijScore for the jth dimensional feature on i-th node or limit.R is the intrinsic dimensionality on each node or limit, such as 3-dimensional.
Step 107, obtains the hot information in described pending text according to analysis of central issue result.
Specifically, importance degree score connecting as the hot information in described pending text more than the phrase set represented by the node of threshold value or limit can be selected;Or select the connection setting the phrase represented by the node of number (such as 10) or limit as the hot information in described pending text from high to low according to importance degree score.In Fig. 6, three groups of hot informations of acquisition are: network-node, network-structure, node-major part.
The hot information of the embodiment of the present invention finds method, carries out the structure of small-world network according to interdependent syntactic analysis, can preferably this semantic information of stet.After described network struction completes, calculate network correlated characteristic and sort, analysis of central issue is carried out according to the result after sequence, the focus vocabulary relevant information in pending text is obtained according to analysis of central issue result, such that it is able to analyze the hot information of pending text efficiently and accurately, and then effectively promote the speed that user version is read, save reading time.
Correspondingly, the embodiment of the present invention also provides for a kind of hot information and finds system, as it is shown in fig. 7, be a kind of structural representation of this system.
In this embodiment, described system includes:
Text acquisition module 701, is used for obtaining pending text;
Pretreatment module 702, for carrying out participle and part-of-speech tagging to described pending text;
Syntactic analysis module 703, for the text after participle is carried out syntactic analysis, obtains the interdependent syntax tree of every words in described pending text;
Sorting module 704, for the stop words removed in pending text in the interdependent syntax tree of every words, obtains interdependent syntax tree to be analyzed;
Network struction module 705, is used for utilizing described interdependent syntax tree to be analyzed to build small-world network;
Analysis of central issue module 706, for carrying out analysis of central issue according to described interdependent syntax tree to be analyzed and described small-world network;
Hot information acquisition module 707, for obtaining the hot information in described pending text according to analysis of central issue result.
Above-mentioned pretreatment module 702 can use method based on condition random field that described pending text is carried out participle and part-of-speech tagging.Above-mentioned syntactic analysis module 703 can use maximum spanning tree algorithm or method based on neutral net that the text after participle is carried out interdependent syntactic analysis, obtains the interdependent syntax tree of every words in described pending text.Certainly, the two module can also use other method to complete the process of participle, part-of-speech tagging and syntactic analysis, does not limits this embodiment of the present invention.
It should be noted that described sorting module 704 is for the interdependent syntax tree of every words in pending text, remove stop words therein according to identical principle, and the node after removing stop words is attached.Such as, relying on the principle of left sibling according to right node, the node after removing stop words is attached, or relies on the principle of right node according to left sibling, and the node after removing stop words is attached.It addition, to be also transferred completely into removing the dependence represented by each edge before stop words on newly-generated limit.Furthermore it is also possible to corresponding dependence importance degree is set to the meansigma methods of all dependence importance degrees on newly-generated limit, it is of course also possible to select a representational dependence importance degree as the dependence importance degree on newly-generated limit.
In actual applications, described analysis of central issue module 706 can carry out analysis of central issue by calculating the importance degree score on each node and/or limit in small-world network.A kind of concrete structure of this module includes: interdependent frequency computing module, feature calculation module and importance degree points calculating module;Described feature calculation module includes: interdependency computing module and/or betweenness centrality computing module.Wherein:
Described interdependent frequency computing module, for calculating each node and the interdependent frequency of each edge in described interdependent syntax tree to be analyzed according to described interdependent syntax tree to be analyzed, the interdependent frequency of described node refers to the importance degree sum of node identical with described node in the interdependent syntax tree all to be analyzed of described pending text, the interdependent frequency on described limit refer to the interdependent syntax tree all to be analyzed of pending text occurs with when the dependence importance degree sum on identical all limits, front, described same edge refers to that the node that described limit connects is identical;
Described interdependency computing module, for calculating each node and the interdependency of each edge in described small-world network according to described small-world network, the interdependency of described node refers to the dependence importance degree sum on the limit being connected in described small-world network with this node, and the interdependency on described limit refers to the sum of two node interdependencies that described limit connects;
Described betweenness centrality computing module, for calculating each node and the betweenness centrality of each edge in described small-world network according to described small-world network, described betweenness centrality refers to that described node or limit occur in the number of times in described small-world network on the shortest path of other any two nodes;
Described importance degree points calculating module, for calculating each node and/or the importance degree score on limit in described small-world network according to described interdependent frequency and described network correlated characteristic, described network correlated characteristic includes: described interdependency and/or betweenness centrality.
Correspondingly, above-mentioned hot information acquisition module 707 can select importance degree score connecting as the hot information in described pending text more than the phrase represented by the node setting threshold value or limit;Or select the connection setting the phrase represented by the node of number or limit as the hot information in described pending text from high to low according to importance degree score.
The hot information of the embodiment of the present invention finds system, carries out the structure of small-world network according to interdependent syntactic analysis, can preferably this semantic information of stet.After described network struction completes, calculate network correlated characteristic and sort, analysis of central issue is carried out according to the result after sequence, the focus vocabulary relevant information in pending text is obtained according to analysis of central issue result, such that it is able to analyze the hot information of pending text efficiently and accurately, and then effectively promote the speed that user version is read, save reading time.
It should be noted that the hot information of the embodiment of the present invention finds method and system, can apply to the fields such as natural language processing, information search, information processing, the focus vocabulary relevant information played an important role can be obtained efficiently and accurately in pending text.
Each embodiment in this specification all uses the mode gone forward one by one to describe, and between each embodiment, identical similar part sees mutually, and what each embodiment stressed is the difference with other embodiments.For system embodiment, owing to it is substantially similar to embodiment of the method, so describing fairly simple, relevant part sees the part of embodiment of the method and illustrates.System embodiment described above is only schematically, the wherein said unit illustrated as separating component can be or may not be physically separate, the parts shown as unit can be or may not be physical location, i.e. may be located at a place, or can also be distributed on multiple NE.Some or all of module therein can be selected according to the actual needs to realize the purpose of the present embodiment scheme.Those of ordinary skill in the art, in the case of not paying creative work, are i.e. appreciated that and implement.
Being described in detail the embodiment of the present invention above, the present invention is set forth by detailed description of the invention used herein, and the explanation of above example is only intended to help to understand the method and system of the present invention;Simultaneously for one of ordinary skill in the art, according to the thought of the present invention, the most all will change, in sum, this specification content should not be construed as limitation of the present invention.

Claims (12)

1. a hot information finds method, it is characterised in that including:
Obtain pending text;
Described pending text is carried out participle and part-of-speech tagging;
Text after participle is carried out syntactic analysis, obtains the interdependent syntax of every words in described pending text Tree;
Remove the stop words in the interdependent syntax tree of every words in pending text, obtain interdependent syntax to be analyzed Tree;
Described interdependent syntax tree to be analyzed is utilized to build small-world network;
Analysis of central issue is carried out according to described interdependent syntax tree to be analyzed and described small-world network;
The hot information in described pending text is obtained according to analysis of central issue result.
Method the most according to claim 1, it is characterised in that described described pending text is entered Row participle and part-of-speech tagging include:
Use method based on condition random field that described pending text is carried out participle and part-of-speech tagging.
Method the most according to claim 1, it is characterised in that described text after participle is carried out Syntactic analysis, obtains the interdependent syntax tree of every words in described pending text and includes:
Maximum spanning tree algorithm or method based on neutral net is used to carry out interdependent to the text after participle Syntactic analysis, obtains the interdependent syntax tree of every words in described pending text.
Method the most according to claim 1, it is characterised in that every in the pending text of described removal Stop words in the interdependent syntax tree of word, obtains interdependent syntax tree to be analyzed and includes:
For the interdependent syntax tree of every words in pending text, remove therein disabling according to identical principle Word, and the node after stop words will be removed be attached;
By removing the dependence represented by each edge before stop words, it is transferred completely into newly-generated limit On, and corresponding dependence importance degree is set to the average of all dependence importance degrees on newly-generated limit Value.
5. according to the method described in any one of Claims 1-4, it is characterised in that described in described basis Interdependent syntax tree to be analyzed and described small-world network carry out analysis of central issue and include:
According to each node in the described interdependent syntax tree to be analyzed described interdependent syntax tree to be analyzed of calculating with every The interdependent frequency on bar limit, the interdependent frequency of described node refers to the interdependent sentence all to be analyzed of described pending text The importance degree sum of node identical with described node in method tree, the interdependent frequency on described limit refers to pending text Interdependent syntax tree all to be analyzed in occur with when the dependence importance degree on identical all limits, front Sum, described same edge refers to that the node that described limit connects is identical;
Each node and the network phase of each edge in described small-world network is calculated according to described small-world network Closing feature, described network correlated characteristic includes: interdependency and/or betweenness centrality, described node interdependent Degree refers to the dependence importance degree sum on the limit being connected in described small-world network, depending on of described limit with this node Degree of depositing refers to the sum of two node interdependencies that described limit connects, and described betweenness centrality refers to that described node or limit go out Number of times on the shortest path of other any two nodes in presently described small-world network;
Each node in described small-world network is calculated according to described interdependent frequency and described network correlated characteristic And/or the importance degree score on limit.
Method the most according to claim 5, it is characterised in that described obtain according to analysis of central issue result The hot information taken in described pending text includes:
Select the connection more than the phrase set represented by the node of threshold value or limit of the importance degree score as described Hot information in pending text;Or
Select to set the connection of the phrase represented by the node of number or limit from high to low according to importance degree score As the hot information in described pending text.
7. a hot information finds system, it is characterised in that including:
Text acquisition module, is used for obtaining pending text;
Pretreatment module, for carrying out participle and part-of-speech tagging to described pending text;
Syntactic analysis module, for the text after participle is carried out syntactic analysis, obtains described pending text In every words interdependent syntax tree;
Sorting module, for the stop words removed in pending text in the interdependent syntax tree of every words, obtains Interdependent syntax tree to be analyzed;
Network struction module, is used for utilizing described interdependent syntax tree to be analyzed to build small-world network;
Analysis of central issue module, for carrying out heat according to described interdependent syntax tree to be analyzed and described small-world network Point analysis;
Hot information acquisition module, for obtaining the focus in described pending text according to analysis of central issue result Information.
System the most according to claim 7, it is characterised in that described pretreatment module use based on The method of condition random field carries out participle and part-of-speech tagging to described pending text.
System the most according to claim 7, it is characterised in that described syntactic analysis module uses Big spanning tree algorithm or method based on neutral net carry out interdependent syntactic analysis to the text after participle, The interdependent syntax tree of every words in described pending text.
System the most according to claim 7, it is characterised in that
Described sorting module, specifically for for the interdependent syntax tree of every words in pending text, according to phase With principle remove stop words therein, and the node after stop words will be removed be attached;Removal is disabled The dependence represented by each edge before word, is transferred completely on newly-generated limit, and is depended on by corresponding The relation importance degree of depositing is set to the meansigma methods of all dependence importance degrees on newly-generated limit.
11. according to the system described in any one of claim 7 to 10, it is characterised in that described focus divides Analysis module includes: interdependent frequency computing module, feature calculation module and importance degree points calculating module;Described Feature calculation module includes: interdependency computing module and/or betweenness centrality computing module;
Described interdependent frequency computing module, for calculating described to be analyzed according to described interdependent syntax tree to be analyzed Each node and the interdependent frequency of each edge in interdependent syntax tree, the interdependent frequency of described node refer to described in wait to locate The importance degree sum of node identical with described node in the interdependent syntax tree all to be analyzed of reason text, described The interdependent frequency on limit refer to the interdependent syntax tree all to be analyzed of pending text occurs with when front identical The dependence importance degree sum on all limits, described same edge refers to that the node that described limit connects is identical;
Described interdependency computing module, for calculating in described small-world network every according to described small-world network Individual node and the interdependency of each edge, the interdependency of described node refer in described small-world network with this node phase The dependence importance degree sum on limit even, the interdependency on described limit refers to that two nodes that described limit connects are interdependent The sum of degree;
Described betweenness centrality computing module, for calculating described small-world network according to described small-world network In each node and the betweenness centrality of each edge, described betweenness centrality refers to that described node or limit occur in institute State the number of times on the shortest path of other any two nodes in small-world network;
Described importance degree points calculating module, based on according to described interdependent frequency and described network correlated characteristic Calculating the importance degree score on each node and/or limit in described small-world network, described network correlated characteristic includes: Described interdependency and/or betweenness centrality.
12. systems according to claim 11, it is characterised in that
Described hot information acquisition module, specifically for select importance degree score more than set threshold value node or Connecting as the hot information in described pending text of phrase represented by limit;Or obtain according to importance degree Select point from high to low the connection setting the phrase represented by the node of number or limit as described pending literary composition Hot information in Ben.
CN201510137773.9A 2015-03-26 2015-03-26 Hot information finds method and system Active CN106156041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510137773.9A CN106156041B (en) 2015-03-26 2015-03-26 Hot information finds method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510137773.9A CN106156041B (en) 2015-03-26 2015-03-26 Hot information finds method and system

Publications (2)

Publication Number Publication Date
CN106156041A true CN106156041A (en) 2016-11-23
CN106156041B CN106156041B (en) 2019-05-28

Family

ID=57339541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510137773.9A Active CN106156041B (en) 2015-03-26 2015-03-26 Hot information finds method and system

Country Status (1)

Country Link
CN (1) CN106156041B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108133014A (en) * 2017-12-22 2018-06-08 广州数说故事信息科技有限公司 Triple generation method, device and user terminal based on syntactic analysis and cluster
CN109062902A (en) * 2018-08-17 2018-12-21 科大讯飞股份有限公司 A kind of text semantic expression and device
CN110069624A (en) * 2019-04-28 2019-07-30 北京小米智能科技有限公司 Text handling method and device
CN110852095A (en) * 2018-08-02 2020-02-28 中国银联股份有限公司 Statement hot spot extraction method and system
CN110874531A (en) * 2020-01-20 2020-03-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN111209746A (en) * 2019-12-30 2020-05-29 航天信息股份有限公司 Natural language processing method, device, storage medium and electronic equipment
CN111339751A (en) * 2020-05-15 2020-06-26 支付宝(杭州)信息技术有限公司 Text keyword processing method, device and equipment
CN113434751A (en) * 2021-07-14 2021-09-24 国际关系学院 Network hotspot artificial intelligence early warning system and method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661513A (en) * 2009-10-21 2010-03-03 上海交通大学 Detection method of network focus and public sentiment
CN102567405A (en) * 2010-12-31 2012-07-11 北京安码科技有限公司 Hotspot discovery method based on improved text space vector representation
US20130204882A1 (en) * 2012-02-07 2013-08-08 Social Market Analytics, Inc. Systems And Methods of Detecting, Measuring, And Extracting Signatures of Signals Embedded in Social Media Data Streams
CN103473223A (en) * 2013-09-25 2013-12-25 中国科学院计算技术研究所 Rule extraction and translation method based on syntax tree
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN104516874A (en) * 2014-12-29 2015-04-15 北京牡丹电子集团有限责任公司数字电视技术中心 Method and system for parsing dependency of noun phrases
CN105095288A (en) * 2014-05-14 2015-11-25 腾讯科技(深圳)有限公司 Data analysis method and data analysis device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661513A (en) * 2009-10-21 2010-03-03 上海交通大学 Detection method of network focus and public sentiment
CN102567405A (en) * 2010-12-31 2012-07-11 北京安码科技有限公司 Hotspot discovery method based on improved text space vector representation
US20130204882A1 (en) * 2012-02-07 2013-08-08 Social Market Analytics, Inc. Systems And Methods of Detecting, Measuring, And Extracting Signatures of Signals Embedded in Social Media Data Streams
CN103473223A (en) * 2013-09-25 2013-12-25 中国科学院计算技术研究所 Rule extraction and translation method based on syntax tree
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN105095288A (en) * 2014-05-14 2015-11-25 腾讯科技(深圳)有限公司 Data analysis method and data analysis device
CN104516874A (en) * 2014-12-29 2015-04-15 北京牡丹电子集团有限责任公司数字电视技术中心 Method and system for parsing dependency of noun phrases

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108133014A (en) * 2017-12-22 2018-06-08 广州数说故事信息科技有限公司 Triple generation method, device and user terminal based on syntactic analysis and cluster
CN110852095A (en) * 2018-08-02 2020-02-28 中国银联股份有限公司 Statement hot spot extraction method and system
CN110852095B (en) * 2018-08-02 2023-09-19 中国银联股份有限公司 Statement hot spot extraction method and system
CN109062902B (en) * 2018-08-17 2022-12-06 科大讯飞股份有限公司 Text semantic expression method and device
CN109062902A (en) * 2018-08-17 2018-12-21 科大讯飞股份有限公司 A kind of text semantic expression and device
CN110069624B (en) * 2019-04-28 2021-05-04 北京小米智能科技有限公司 Text processing method and device
CN110069624A (en) * 2019-04-28 2019-07-30 北京小米智能科技有限公司 Text handling method and device
US11551008B2 (en) 2019-04-28 2023-01-10 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and device for text processing
CN111209746A (en) * 2019-12-30 2020-05-29 航天信息股份有限公司 Natural language processing method, device, storage medium and electronic equipment
CN111209746B (en) * 2019-12-30 2024-01-30 航天信息股份有限公司 Natural language processing method and device, storage medium and electronic equipment
CN110874531A (en) * 2020-01-20 2020-03-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN111339751A (en) * 2020-05-15 2020-06-26 支付宝(杭州)信息技术有限公司 Text keyword processing method, device and equipment
CN113434751A (en) * 2021-07-14 2021-09-24 国际关系学院 Network hotspot artificial intelligence early warning system and method
CN113434751B (en) * 2021-07-14 2023-06-02 国际关系学院 Network hotspot artificial intelligent early warning system and method

Also Published As

Publication number Publication date
CN106156041B (en) 2019-05-28

Similar Documents

Publication Publication Date Title
CN106156041A (en) Hot information finds method and system
CN106055604B (en) Word-based network carries out the short text topic model method for digging of feature extension
CN105224648A (en) A kind of entity link method and system
CN105095204B (en) The acquisition methods and device of synonym
CN103399901B (en) A kind of keyword abstraction method
CN106844341B (en) Artificial intelligence-based news abstract extraction method and device
WO2008043645B1 (en) Establishing document relevance by semantic network density
CN107992480B (en) Method, device, storage medium and program product for realizing entity disambiguation
CN107748745B (en) Enterprise name keyword extraction method
CN104765729A (en) Cross-platform micro-blogging community account matching method
CN110321466A (en) A kind of security information duplicate checking method and system based on semantic analysis
CN105630884A (en) Geographic position discovery method for microblog hot event
CN103116573B (en) A kind of automatic extending method of domain lexicon based on vocabulary annotation
CN104731828A (en) Interdisciplinary document similarity calculation method and interdisciplinary document similarity calculation device
CN110019806B (en) Document clustering method and device
CN102779119B (en) A kind of method of extracting keywords and device
WO2014201109A1 (en) Search term clustering
CN102063497B (en) Open type knowledge sharing platform and entry processing method thereof
CN104951478A (en) Information processing method and information processing device
CN104408036B (en) It is associated with recognition methods and the device of topic
Alam et al. A review on clustering of web search result
CN102708104B (en) Method and equipment for sorting document
CN105354264B (en) A kind of quick adding method of theme label based on local sensitivity Hash
Alfarra et al. Graph-based technique for extracting keyphrases in a single-document (gtek)
CN107679194B (en) Text-based entity relationship construction method, device and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant