CN106156041B

CN106156041B - Hot information finds method and system

Info

Publication number: CN106156041B
Application number: CN201510137773.9A
Authority: CN
Inventors: 吴及; 侯晋峰; 胡国平; 吕萍; 王影; 胡郁; 刘庆峰
Original assignee: Tsinghua University; iFlytek Co Ltd
Current assignee: Tsinghua University; iFlytek Co Ltd
Priority date: 2015-03-26
Filing date: 2015-03-26
Publication date: 2019-05-28
Anticipated expiration: 2035-03-26
Also published as: CN106156041A

Abstract

The invention discloses a kind of hot informations to find method and system, this method comprises: obtaining text to be processed；Participle and part-of-speech tagging are carried out to the text to be processed；Syntactic analysis is carried out to the text after participle, obtains the interdependent syntax tree of every words in the text to be processed；The stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent syntax tree to be analyzed is obtained；Small-world network is constructed using the interdependent syntax tree to be analyzed；Analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network；The hot information in the text to be processed is obtained according to analysis of central issue result.Using the present invention, the hot information in text to be processed can be efficiently and accurately found.

Description

Hot information finds method and system

Technical field

The present invention relates to data mining technology fields, and in particular to a kind of hot information discovery method and system.

Background technique

With the fast development of internet and being constantly progressive for memory technology, more and more text informations are flooded with us Around.But there is a large amount of redundancy in these information, step-by-step reading will obviously waste user's a large amount of time And energy.Analysis of central issue method can promptly extract crucial vocabulary or sentence information from a large amount of text information, i.e., Hot information allows user can be convenient and quickly recognizes important information included in text, to become researcher Research hotspot therefore how analysis of central issue efficiently and accurately can be carried out to text, find in text to be processed corresponding heat Point information becomes the top priority of analysis of central issue.

Existing analysis of central issue method is generally based on vocabulary co-occurrence method building small-world network, according to the network meter The different degree for calculating each node in network, the hot information of text to be processed is determined according to the different degree information.It is described important The calculating of degree is determined according to the average shortest path length variable quantity of the network.Existing method carries out the network struction When, do not consider the semantic information between vocabulary generally, the network of building is only measured according to the distance of adjacent words.However, If two vocabulary are in the text relatively far apart, but it is very close semantically contacting, and existing method can not then find this Connection.In addition, existing method only measures the weight of each node when calculating the different degree of each node using only shortest path It spends, feature is more single.The higher vocabulary of different degree obtained using existing method can not necessarily represent original text semanteme letter Breath.Simultaneously when calculating the different degree of each node, require to calculate shortest path all in network every time, efficiency compared with It is low.

Summary of the invention

The embodiment of the present invention provides a kind of hot information discovery method and system, to be processed efficiently and accurately to find Hot information in text.

For this purpose, the embodiment of the present invention provides the following technical solutions:

A kind of hot information discovery method, comprising:

Obtain text to be processed；

Participle and part-of-speech tagging are carried out to the text to be processed；

Syntactic analysis is carried out to the text after participle, obtains the interdependent syntax tree of every words in the text to be processed；

The stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent syntax tree to be analyzed is obtained；

Small-world network is constructed using the interdependent syntax tree to be analyzed；

Analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network；

The hot information in the text to be processed is obtained according to analysis of central issue result.

Preferably, it is described to the text to be processed carry out participle and part-of-speech tagging include:

Participle and part-of-speech tagging are carried out to the text to be processed using the method based on condition random field.

Preferably, the text after described pair of participle carries out syntactic analysis, obtain in the text to be processed every words according to Depositing syntax tree includes:

Interdependent syntax point is carried out to the text after participle using maximum spanning tree algorithm or method neural network based Analysis obtains the interdependent syntax tree of every words in the text to be processed.

Preferably, the stop words in removal text to be processed in the interdependent syntax tree of every words, obtain it is to be analyzed according to Depositing syntax tree includes:

For the interdependent syntax tree of every words in text to be processed, stop words therein is removed according to identical principle, and Node after removal stop words is attached；

Dependence represented by each edge before removal stop words is transferred completely on newly-generated side, and will Corresponding dependence different degree is set as the average value of all dependence different degrees on newly-generated side.

Preferably, described to include: according to the interdependent syntax tree to be analyzed and small-world network progress analysis of central issue

Each node and each edge in the interdependent syntax tree to be analyzed are calculated according to the interdependent syntax tree to be analyzed Interdependent frequency, the interdependent frequency of the node refer in all interdependent syntax trees to be analyzed of the text to be processed with the node The sum of the different degree of identical node, the interdependent frequency on the side refer in all interdependent syntax trees to be analyzed of text to be processed The sum of the dependence different degree on all sides identical with front is worked as occurred, the same edge refer to the node phase of the side connection Together；

The network correlated characteristic of each node and each edge in the small-world network is calculated according to the small-world network, The network correlated characteristic includes: interdependency and/or betweenness center, and the interdependency of the node refers in the small-world network The sum of the dependence different degree on the side being connected with the node, it is described while interdependency refer to described while two nodes connecting it is interdependent The sum of degree, the betweenness center refer to that the node or side appear in the small-world network any other two nodes most Number on short path；

Each node and/or side in the small-world network are calculated according to the interdependent frequency and the network correlated characteristic Different degree score.

Preferably, described hot information in the text to be processed is obtained according to analysis of central issue result to include:

Selection different degree score is described to be processed greater than being connected to for phrase represented by the node of given threshold or side Hot information in text；Or

Select phrase represented by the node for setting number or side from high to low according to different degree score is connected to institute State the hot information in text to be processed.

A kind of hot information discovery system, comprising:

Text obtains module, for obtaining text to be processed；

Preprocessing module, for carrying out participle and part-of-speech tagging to the text to be processed；

Syntactic analysis module obtains in the text to be processed every for carrying out syntactic analysis to the text after participle The interdependent syntax tree of words；

Sorting module obtains to be analyzed for removing the stop words in text to be processed in the interdependent syntax tree of every words Interdependent syntax tree；

Network struction module, for constructing small-world network using the interdependent syntax tree to be analyzed；

Analysis of central issue module, for carrying out hot spot point according to the interdependent syntax tree to be analyzed and the small-world network Analysis；

Hot information obtains module, for obtaining the hot information in the text to be processed according to analysis of central issue result.

Preferably, the preprocessing module segments the text to be processed using the method based on condition random field And part-of-speech tagging.

Preferably, the syntactic analysis module using maximum spanning tree algorithm or method neural network based to point Text after word carries out interdependent syntactic analysis, obtains the interdependent syntax tree of every words in the text to be processed.

Preferably, the sorting module, specifically for the interdependent syntax tree for every words in text to be processed, according to phase Same principle removes stop words therein, and the node after removal stop words is attached；It will be every before removing stop words Dependence represented by side, is transferred completely on newly-generated side, and set new for corresponding dependence different degree Generate the average value of all dependence different degrees on side.

Preferably, the analysis of central issue module includes: interdependent frequency computing module, feature calculation module and different degree score Computing module；The feature calculation module includes: interdependency computing module and/or betweenness center computing module；

The interdependent frequency computing module, for calculating the interdependent sentence to be analyzed according to the interdependent syntax tree to be analyzed The interdependent frequency of each node and each edge in method tree, the interdependent frequency of the node refer to needing point for the text to be processed The sum of the different degree of node identical with the node in interdependent syntax tree is analysed, the interdependent frequency on the side refers to text to be processed The sum of the dependence different degree on all sides identical with front is worked as occurred in all interdependent syntax trees to be analyzed is described identical The node connected when referring to described is identical；

The interdependency computing module, for calculating each node in the small-world network according to the small-world network With the interdependency of each edge, the interdependency of the node refers to the dependence on the side being connected in the small-world network with the node The sum of different degree, it is described while interdependency refer to described while the sum of two node interdependencies that connects；

The betweenness center computing module, it is each in the small-world network for being calculated according to the small-world network The betweenness center of node and each edge, the betweenness center refer to that the node or side appear in its in the small-world network Number on the shortest path of his any two node；

The different degree points calculating module, for according to the interdependent frequency and network correlated characteristic calculating The different degree score on each node and/or side in small-world network, the network correlated characteristic include: the interdependency, and/or Betweenness center.

Preferably, the hot information obtains module, and the node of given threshold is greater than specifically for selection different degree score Or the hot information of phrase represented by side being connected in the text to be processed；Or according to different degree score by height to The hot information of phrase represented by the node of low selection setting number or side being connected in the text to be processed.

Hot information provided in an embodiment of the present invention finds method and system, carries out worldlet according to interdependent syntactic analysis The building of network, can preferably stet sheet semantic information.After the completion of the network struction, network correlated characteristic is calculated And sort, analysis of central issue is carried out according to the result after sequence, the hot spot word in text to be processed is obtained according to analysis of central issue result Remittance relevant information, so as to efficiently and accurately analyze the hot information of text to be processed, and then effectively promotes user version The speed of reading saves reading time.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only one recorded in the present invention A little embodiments are also possible to obtain other drawings based on these drawings for those of ordinary skill in the art.

Fig. 1 is a kind of flow chart of hot information of embodiment of the present invention discovery method；

Fig. 2 is that the interdependent syntax tree example one before stop words is removed in the embodiment of the present invention；

Fig. 3 is that the interdependent syntax tree example one after stop words is removed in the embodiment of the present invention；

Fig. 4 is interdependent syntax tree example two in the embodiment of the present invention；

Fig. 5 is interdependent syntax tree example three in the embodiment of the present invention；

Fig. 6 is the small-world network part illustrated example constructed in the embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of hot information of embodiment of the present invention discovery system.

Specific embodiment

The scheme of embodiment in order to enable those skilled in the art to better understand the present invention with reference to the accompanying drawing and is implemented Mode is described in further detail the embodiment of the present invention.

As shown in Figure 1, being a kind of flow chart of hot information of embodiment of the present invention discovery method, comprising the following steps:

Step 101, text to be processed is obtained.

Step 102, participle and part-of-speech tagging are carried out to the text to be processed.

For example, participle and part-of-speech tagging can be carried out to the text to be processed using the method based on condition random field. Certainly, other methods can also be used and carry out participle and part-of-speech tagging, if participle can be matched with most long word, part-of-speech tagging can be used Method etc. based on HMM (Hidden Markov Model, hidden Markov model).

Step 103, syntactic analysis is carried out to the text after participle, obtains the interdependent sentence of every words in the text to be processed Method tree.

For example, can using maximum spanning tree algorithm or method neural network based to the text after participle carry out according to Syntactic analysis is deposited, the interdependent syntax tree of every words in the text to be processed is obtained.

For example, a word in text to be processed is " structure that small-world network is a kind of Special complex network ", Interdependent syntax tree is as shown in Figure 2.Wherein, the letter abbreviations on side are dependence, and every kind of dependence is endowed different weights It spends, as shown in table 1 below.

Table 1:

Dependence	Different degree
		Relationship ATT in fixed	1.0
Subject-predicate relationship SBV	1.0
		Dynamic guest's relationship VOB	1.0
Quantitative relation QUN	0.9
		" " word structure DE	0.5

Step 104, the stop words in text to be processed in the interdependent syntax tree of every words is removed, interdependent sentence to be analyzed is obtained Method tree.

The stop words refers to word nonsensical in text to be processed, such as " this ", "Yes", " uh ".

When removing stop words, for all interdependent syntax trees in text to be processed, it is based on identical principle, removes it In stop words.For example, relying on the principle of left sibling according to right node, the node after removal stop words is attached.Compare again Such as, the node after removal stop words is attached by the principle that right node can also be relied on according to left sibling.Furthermore it is also possible to By dependence represented by each edge before removal stop words, it is transferred completely on newly-generated side, corresponding interdependent pass Be different degree be newly-generated side on all dependence different degrees average value, it is of course also possible to select one it is representative according to Dependence different degree of the relationship different degree as newly-generated side is deposited, without limitation to this embodiment of the present invention.As shown in figure 3, It is interdependent after removing stop words for a word " structure that small-world network is a kind of Special complex network " in text to be processed Syntax tree.Wherein, there are two types of dependences with "Yes" node before " network " node and " structure " knot-removal stop words, i.e., SVB and VOB, referring to fig. 2.After removing stop words, both dependences are transferred on newly-generated side, newly-generated side Dependence different degree is the average value of described two dependence different degrees.

Step 105, small-world network is constructed using the interdependent syntax tree to be analyzed.

Small-world network is constructed according to the interdependent syntax tree after every words removal stop words, detailed process is as follows:

1) abortive haul network G=(V, E) is initialized；V indicates the set of node, and E indicates the set on side；

2) the interdependent syntax tree in text to be processed after every words removal stop words is successively obtained；

3) according to depth-first or the principle of breadth First, every interdependent syntax tree is successively traversed since root node；

4) when traversing a node, judge that present node whether there is in set V, if it is present under successively traversing One node；If it does not exist, then present node is added in set V；

5) when traversing a line, judgement whether there is in set E when front, if it is present successively traversing next Side, if it does not exist, then will be added in E when front；

6) judging whether all interdependent syntax trees of text to be processed traverse terminates, if terminating to execute step 7), otherwise holds Row step is 2)；

7) the interdependent syntax tree in text to be processed has all been traversed, small-world network G=(V, E) is obtained.

If Fig. 4 is that second word " in such networks most node each other and be not attached to " is gone in text to be processed Except the interdependent syntax tree after stop words, Fig. 5 is that text third word to be processed " but passes through several steps just between most of node It is reachable " removal stop words after interdependent syntax tree.

According to small-world network such as Fig. 6 institute of the interdependent syntax tree building after all removal stop words of text to be processed Show, Fig. 6 is the corresponding part small-world network figure of text to be processed.

Step 106, analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network.

Specifically, each node in the interdependent syntax tree to be analyzed can be calculated according to the interdependent syntax tree to be analyzed With the interdependent frequency of each edge, and each node and each edge in the small-world network are calculated according to the small-world network Network correlated characteristic, the network correlated characteristic include: interdependency and/or betweenness center；Then according to the interdependent frequency And the network correlated characteristic calculates the different degree score on each node and/or side in the small-world network.

Above-mentioned interdependent frequency, interdependency, the concept of betweenness center and calculation are described in detail below.

1) the interdependent frequency of each node and each edge in interdependent syntax tree to be analyzed is calculated according to interdependent syntax tree to be analyzed Degree.

Phrase number dependent on current vocabulary is more, then the interdependent frequency of this vocabulary is higher, the phrase according to Depositing in syntax tree is indicated using node.

The interdependent frequency of the node refers to identical as present node in all interdependent syntax trees to be analyzed of text to be processed The sum of the different degree of node, the calculation method of the different degree is all nodes for directly relying on or indirectly relying on present node Several square roots, in Fig. 3, the node for directly relying on " network " node has 2, and the node for indirectly relying on " network " node has 4 It is a, rely on number of nodes totally 6, then the square root that different degree of " network " node on the interdependent syntax tree of Fig. 3 is 6, i.e., 2.45. Similarly, different degree of " network " node on the interdependent syntax tree of Fig. 4 be 1, if " network " word in text to be processed only There is this twice, then the interdependent frequency of " network " node is 2.45+1=3.45.Shown in circular such as formula (1).

Wherein, NDDeg_iIndicate the interdependent frequency of i-th of node, V_iIndicate number of nodes identical with i-th of node, Npro_j For all number of nodes for directly or indirectly relying on j-th of node.

The interdependent frequency on the side refer to occur in all interdependent syntax trees to be analyzed of text to be processed with when front phase Same the sum of the dependence different degree on all sides, the same edge refer to that the node of the side connection is identical.Such as Fig. 3 " worldlet- The dependence of network " this edge is ATT, and corresponding different degree is 1.0, if having also appeared " small a generation in whole network The side on boundary-network ", dependence LAD, corresponding different degree is 0.6, then the interdependent frequency of " worldlet-network " this edge Degree is 1.6, shown in circular such as formula (2):

Wherein, EDDeg_kIndicate the interdependent frequency on kth side, E_kIndicate number of edges identical with kth side, IDeg_eIndicate the The dependence different degree on e side.

2) interdependency of each node and each edge in network is calculated according to small-world network.

According to every kind of dependence different degree in interdependent syntactic relation, each node and each edge in the network are calculated Interdependency.

The interdependency of the node refers to the sum of the dependence different degree on the side being connected in network with the node.Such as Fig. 3 In, " network " node shares 2 sides and is connected, and the dependence of a line is ATT, and corresponding dependence different degree is 1.0, The dependence on Article 2 side is SBV-VOB, and corresponding dependence different degree is the flat of SBV and VOB dependence different degree Mean value, i.e., 1.0.Therefore, the interdependency of " network " node is 2.0, as shown in formula (3).

Wherein, NIDeg_iIndicate the interdependency of i-th of node, N_iIndicate the number on the side being connected with i-th of node, IDeg_k Indicate the corresponding dependence different degree in kth side.

It is described while interdependency refer to described while the sum of two node interdependencies that connects, in Fig. 3, " worldlet-network " The interdependency of this edge is the sum of the interdependency of " worldlet " node and " network " node, specific to calculate as shown in formula (4):

EIDeg_k=NIDeg_i1+NIDeg_i2 (4)

Wherein, EIDeg_kIndicate the interdependency on kth side, NIDeg_i1And NIDeg_i2Indicate two connect with kth side The interdependency of node i 1 and i2.

3) betweenness center on each node or side in network is calculated according to small-world network

The betweenness center refers to that the node or side occur on the shortest path of any other two nodes in a network Number, such as in Fig. 3, shortest path between " worldlet " node and " structure " node is " worldlet-network-structure ", Shortest path length is 2, and " network " node has appeared on the shortest path of " worldlet " node and " structure " node, then " net The betweenness center of network " node is 1, if " network " node also occurs on the shortest path between other two node, institute The betweenness center for stating node is 2." worldlet-network " this edge also appears on shortest path, if the side does not occur On shortest path between other nodes, then the betweenness center of " worldlet-network " this edge is 1.When calculating shortest path, Conventional method can be used in the distance between adjacent node measurement, i.e., with 1 measurement, it is possible to use side is interdependent between two nodes The inverse of frequency is measured.If the interdependent frequency on " worldlet-network " side in Fig. 3 is assumed to be 1.6, " worldlet " node and " net The inverse that the distance between network " node measurement is 1.6, i.e., 0.625.

The interdependent frequency and interdependency and/or betweenness center these features are being calculated, can comprehensively utilize These features determine the different degree score on each node and/or side in small-world network.It should be noted that in practical application In, the different degree score on each node and/or side in small-world network can be calculated using these three features simultaneously, it can also It, can also to calculate the different degree score on each node and/or side in small-world network using the interdependent frequency and interdependency To calculate the different degree score on each node and/or side in small-world network using the interdependent frequency and betweenness center, Without limitation to this embodiment of the present invention.

Below to calculate the different degree score on each node and/or side in small-world network using these three features simultaneously For be illustrated.

Using above-mentioned three kinds of features as the three-dimensional feature of each lexical node in the text to be processed, due to every dimensional feature Valued space it is different, can not directly utilize, therefore first the value to every dimensional feature can carry out regular, specific regular method can It is carried out in a manner of using Ordering and marking, or uses other regular methods, if the characteristic value in every dimension is divided by current dimensional feature The summation of value, obtain it is regular after characteristic value.

By taking Ordering and marking method as an example, sorted from small to large to every dimensional feature value, the index after characteristic value is sorted As the score of current characteristic value, such as interdependent frequency of " network " node is 2.45, interdependency 2.0, betweenness center 2, Index after sequence is respectively 3,6,10, then the three-dimensional feature score of the node is respectively 3,6,10.

Using the three-dimensional feature score after regular, the different degree score on each node and/or side in network can be calculated, specifically As shown in formula (5):

Wherein, FScore_iFor i-th of node or the different degree score on side, Score_ijFor the jth on i-th of node or side dimension The score of feature.R is the intrinsic dimensionality on each node or side, such as 3 dimensions.

Step 107, the hot information in the text to be processed is obtained according to analysis of central issue result.

Specifically, it can choose different degree score being connected to greater than phrase represented by the node of given threshold or side Hot information in the text to be processed；Or select setting number (such as 10) from high to low according to different degree score The hot information of phrase represented by node or side being connected in the text to be processed.In Fig. 6, three groups of heat of acquisition Point information are as follows: network-node, network-structure, node-major part.

The hot information of the embodiment of the present invention finds method, and the building of small-world network is carried out according to interdependent syntactic analysis, Can preferably stet sheet semantic information.After the completion of the network struction, calculates network correlated characteristic and sort, according to Result after sequence carries out analysis of central issue, obtains the hot spot vocabulary relevant information in text to be processed according to analysis of central issue result, So as to efficiently and accurately analyze the hot information of text to be processed, and then the speed of user version reading is effectively promoted, Save reading time.

Correspondingly, the embodiment of the present invention also provides a kind of hot information discovery system, as shown in fig. 7, being the one of the system Kind structural schematic diagram.

In this embodiment, the system comprises:

Text obtains module 701, for obtaining text to be processed；

Preprocessing module 702, for carrying out participle and part-of-speech tagging to the text to be processed；

Syntactic analysis module 703 obtains every in the text to be processed for carrying out syntactic analysis to the text after participle The interdependent syntax tree of word；

Sorting module 704 is obtained for removing the stop words in text to be processed in the interdependent syntax tree of every words wait divide Analyse interdependent syntax tree；

Network struction module 705, for constructing small-world network using the interdependent syntax tree to be analyzed；

Analysis of central issue module 706, for carrying out hot spot according to the interdependent syntax tree to be analyzed and the small-world network Analysis；

Hot information obtains module 707, for obtaining the letter of the hot spot in the text to be processed according to analysis of central issue result Breath.

Above-mentioned preprocessing module 702 can segment the text to be processed using the method based on condition random field And part-of-speech tagging.Above-mentioned syntactic analysis module 703 can use maximum spanning tree algorithm or method pair neural network based Text after participle carries out interdependent syntactic analysis, obtains the interdependent syntax tree of every words in the text to be processed.Certainly, this two A module can also complete participle, part-of-speech tagging and the process of syntactic analysis using other methods, to this embodiment of the present invention Without limitation.

It should be noted that interdependent syntax tree of the sorting module 704 for every words in text to be processed, according to phase Same principle removes stop words therein, and the node after removal stop words is attached.For example, being relied on according to right node left Node after removal stop words is attached, or relies on the principle of right node according to left sibling by the principle of node, will remove Node after stop words is attached.In addition, also by dependence represented by each edge before removal stop words, all It is transferred on newly-generated side.Furthermore it is also possible to set all interdependent on newly-generated side for corresponding dependence different degree The average value of relationship different degree, it is of course also possible to select representative dependence different degree as newly-generated side according to Deposit relationship different degree.

In practical applications, the analysis of central issue module 706 can by calculate small-world network in each node and/or The different degree score on side carries out analysis of central issue.A kind of specific structure of the module includes: interdependent frequency computing module, feature meter Calculate module and different degree points calculating module；The feature calculation module includes: interdependency computing module and/or betweenness center Computing module.Wherein:

Correspondingly, above-mentioned hot information obtain module 707 can choose node of the different degree score greater than given threshold or The hot information of phrase represented by side being connected in the text to be processed；Or from high to low according to different degree score The hot information of phrase represented by the node of selection setting number or side being connected in the text to be processed.

The hot information of the embodiment of the present invention finds system, and the building of small-world network is carried out according to interdependent syntactic analysis, Can preferably stet sheet semantic information.After the completion of the network struction, calculates network correlated characteristic and sort, according to Result after sequence carries out analysis of central issue, obtains the hot spot vocabulary relevant information in text to be processed according to analysis of central issue result, So as to efficiently and accurately analyze the hot information of text to be processed, and then the speed of user version reading is effectively promoted, Save reading time.

It should be noted that the hot information of the embodiment of the present invention finds method and system, natural language can be applied to The fields such as processing, information search, information processing can be obtained efficiently and accurately the hot spot word to play an important role in text to be processed Remittance relevant information.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.System embodiment described above is only schematical, wherein described be used as separate part description Unit may or may not be physically separated, component shown as a unit may or may not be Physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to the actual needs Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying In the case where creative work, it can understand and implement.

The embodiment of the present invention has been described in detail above, and specific embodiment used herein carries out the present invention It illustrates, method and system of the invention that the above embodiments are only used to help understand；Meanwhile for the one of this field As technical staff, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute It states, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of hot information finds method characterized by comprising

Obtain text to be processed；

Analysis of central issue is carried out according to the interdependent syntax tree to be analyzed and the small-world network, including is calculated separately described wait divide The network correlated characteristic for analysing each element in the interdependent frequency and the small-world network of each element in interdependent syntax tree, according to institute It states interdependent frequency and the network correlated characteristic carries out analysis of central issue；

2. the method according to claim 1, wherein described carry out participle and part of speech mark to the text to be processed Note includes:

3. being obtained the method according to claim 1, wherein the text after described pair of participle carries out syntactic analysis The interdependent syntax tree of every words includes: in the text to be processed

Interdependent syntactic analysis is carried out to the text after participle using maximum spanning tree algorithm or method neural network based, is obtained The interdependent syntax tree of every words into the text to be processed.

4. the method according to claim 1, wherein every interdependent syntax talked about in the removal text to be processed Stop words in tree, obtaining interdependent syntax tree to be analyzed includes:

For the interdependent syntax tree of every words in text to be processed, stop words therein is removed according to identical principle, and will go Except the node after stop words is attached；

By dependence represented by each edge before removal stop words, it is transferred completely on newly-generated side, and will correspond to Dependence different degree be set as the average value of all dependence different degrees on newly-generated side.

5. method according to any one of claims 1 to 4, which is characterized in that

The interdependent frequency of each element includes: in the calculating interdependent syntax tree to be analyzed

The interdependent of each node and each edge in the interdependent syntax tree to be analyzed is calculated according to the interdependent syntax tree to be analyzed Frequency, the interdependent frequency of the node refer to identical as the node in all interdependent syntax trees to be analyzed of the text to be processed The sum of the different degree of node, the interdependent frequency on the side, which refers to, to be occurred in all interdependent syntax trees to be analyzed of text to be processed The node of the sum of the dependence different degree on all sides identical with front is worked as, the identical side Bian Zhiyu connection is identical；

The network correlated characteristic of each element includes: in the calculating small-world network

The network correlated characteristic of each node and each edge in the small-world network is calculated according to the small-world network, it is described Network correlated characteristic includes: interdependency and/or betweenness center, and the interdependency of the node refers in the small-world network and is somebody's turn to do Node connected the sum of the dependence different degree on side, it is described while interdependency refer to described while two node interdependencies connecting With the betweenness center refers to that the node or side appear in the shortest path of any other two nodes in the small-world network Number on diameter；

It is described to include: according to the interdependent frequency and network correlated characteristic progress analysis of central issue

Each node and/or the weight on side in the small-world network are calculated according to the interdependent frequency and the network correlated characteristic Spend score.

6. according to the method described in claim 5, it is characterized in that, described obtain the text to be processed according to analysis of central issue result Hot information in this includes:

Selection different degree score is connected to the text to be processed greater than phrase represented by the node of given threshold or side In hot information；Or

Selected from high to low according to different degree score setting number node or side represented by described in being connected to of phrase to Handle the hot information in text.

7. a kind of hot information finds system characterized by comprising

Text obtains module, for obtaining text to be processed；

Syntactic analysis module obtains in the text to be processed every words for carrying out syntactic analysis to the text after participle Interdependent syntax tree；

Sorting module obtains to be analyzed interdependent for removing the stop words in text to be processed in the interdependent syntax tree of every words Syntax tree；

Analysis of central issue module, for carrying out analysis of central issue, packet according to the interdependent syntax tree to be analyzed and the small-world network It includes and calculates separately in the interdependent syntax tree to be analyzed each element in the interdependent frequency and the small-world network of each element Network correlated characteristic carries out analysis of central issue according to the interdependent frequency and the network correlated characteristic；

8. system according to claim 7, which is characterized in that the preprocessing module uses the side based on condition random field Method carries out participle and part-of-speech tagging to the text to be processed.

9. system according to claim 7, which is characterized in that the syntactic analysis module using maximum spanning tree algorithm or Person's method neural network based carries out interdependent syntactic analysis to the text after participle, obtains every words in the text to be processed Interdependent syntax tree.

10. system according to claim 7, which is characterized in that

The sorting module is gone specifically for the interdependent syntax tree for every words in text to be processed according to identical principle It is attached except stop words therein, and by the node after removal stop words；It will be represented by each edge before removal stop words Dependence, be transferred completely on newly-generated side, and set institute on newly-generated side for corresponding dependence different degree There is the average value of dependence different degree.

11. according to the described in any item systems of claim 7 to 10, which is characterized in that the analysis of central issue module includes: interdependent Frequency computing module, feature calculation module and different degree points calculating module；The feature calculation module includes: that interdependency calculates Module and/or betweenness center computing module；

The interdependent frequency computing module, for calculating the interdependent syntax tree to be analyzed according to the interdependent syntax tree to be analyzed In each node and each edge interdependent frequency, the interdependent frequency of the node refer to the text to be processed it is all it is to be analyzed according to The sum of the different degree of node identical with the node in syntax tree is deposited, the interdependent frequency on the side refers to all of text to be processed The sum of the dependence different degree on all sides identical with front is worked as occurred in interdependent syntax tree to be analyzed, identical Bian Zhiyu The node of side connection is identical；

The interdependency computing module, for according to each node in the small-world network calculating small-world network and often The interdependency on side, the interdependency of the node refer to that the dependence on the side being connected in the small-world network with the node is important The sum of degree, it is described while interdependency refer to described while the sum of two node interdependencies that connects；

The betweenness center computing module, for calculating each node in the small-world network according to the small-world network With the betweenness center of each edge, the betweenness center refers to that the node or side appear in other in the small-world network The number anticipated on the shortest path of two nodes；

The different degree points calculating module, for calculating the small generation according to the interdependent frequency and the network correlated characteristic The different degree score on each node and/or side in boundary's network, the network correlated characteristic includes: the interdependency, and/or betweenness Centrality.

12. system according to claim 11, which is characterized in that

The hot information obtains module, is greater than represented by node or the side of given threshold specifically for selection different degree score The hot information of phrase being connected in the text to be processed；Or select setting from high to low according to different degree score The hot information of phrase represented by several nodes or side being connected in the text to be processed.