CN102693314B - A kind of sensitive information monitoring method based on event searching - Google Patents

A kind of sensitive information monitoring method based on event searching Download PDF

Info

Publication number
CN102693314B
CN102693314B CN201210170863.4A CN201210170863A CN102693314B CN 102693314 B CN102693314 B CN 102693314B CN 201210170863 A CN201210170863 A CN 201210170863A CN 102693314 B CN102693314 B CN 102693314B
Authority
CN
China
Prior art keywords
event
predicate
feature
word
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210170863.4A
Other languages
Chinese (zh)
Other versions
CN102693314A (en
Inventor
代松
姬东鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN HUAAN SCIENCE AND TECHNOLOGY CO., LTD.
Original Assignee
代松
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 代松 filed Critical 代松
Priority to CN201210170863.4A priority Critical patent/CN102693314B/en
Publication of CN102693314A publication Critical patent/CN102693314A/en
Application granted granted Critical
Publication of CN102693314B publication Critical patent/CN102693314B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention belongs to natural language processing field, a kind of sensitive information monitoring method based on event searching is provided, comprises the steps such as event recognition, event extraction, early warning issue.This method utilizes the search mechanisms based on event, automatically carries out intellectual monitoring to a certain class event, and filter a large amount of irrelevant information, the result obtained has higher accuracy rate.

Description

A kind of sensitive information monitoring method based on event searching
Technical field
The invention belongs to natural language processing field, particularly a kind of sensitive information monitoring method based on event searching.
Background technology
Along with the arriving of Internet information age, be how weigh an important symbol of social progress from internet quick obtaining information accurately, particularly sensitive information.Information means is active, and captures sensitive information in time and also just lays a good foundation for correct decision-making.This all has very important meaning for national security, government operation and business activity.
First, national security is the cardinal task of country, and information security is also a part for national security.Timely control and process sensitive information are the preconditions maintaining national security, maintain social stability.Secondly, government department, as the functional department served the people, must understand the will of the people in time, accurately be aware of public sentiments; Moment grasps production background and the evolution of regarding sensitive event again simultaneously, makes every effort to avoid the generation of negative event or reduce the impact of negative event as far as possible.Again, for commercial department, grasping the market feedback of product in time, understanding the relevant of rival is dynamically that enterprise keeps competitive vigor, constantly improves the important means of properties of product, determines the success or failure of enterprise to a certain extent.
At present, the acquisition of interconnected web sensitive information is obtained by search engine mostly.If such as will search for the message about " which country former premier Li Guangyao Senior Minister of Singapore accesses ", a kind of way is direct for this problem inputted search engine, and front 6 Search Results of Baidu are as follows
1) from " Li Zizheng " speech, lackey's love knot (page 1)-overseas wind and cloud-hundred taste forum is seen ...
2) Asia-Pacific " political tiger " Lee Zhi Sheng Senior Minister--related article
3) Li Zizheng accepts United Press International (UPI) of U.S. access-Xi ancestral temple lane
4) [be posted] Li Zizheng to accept moral day media interviews China and strengthen armament and help platform resolution-.. for weakening the U.S.
5) [∞ south of the River garden ∞] accepts CNN and accesses Li Zizheng: it is unworkable that the U.S. advocates democracy to the whole world ...
6) Li Zizheng: access Malaysian will observe Malaysian's politics trend in person
As can be seen from Search Results, front 5 results are all irrelevant with inquiry, only have the 6th relevant.Wherein 1-2 is only relevant with Li Zizheng, and 3-5 refers to that Li Zizheng accepts the interview.
Another way is that " which country former premier Li Guangyao Senior Minister of Singapore accesses " is reduced to " Li Zizheng access ", and front 6 Search Results of Baidu are as follows
1) Li Zizheng accepts United Press International (UPI) of U.S. access-Xi ancestral temple lane
2) Taiwan adheres to that independence for Taiwan person only occupies the minority---Li Zizheng accepts the access of new platform newspapers
3) [∞ south of the River garden ∞] accepts CNN and accesses Li Zizheng: it is unworkable that the U.S. advocates democracy to the whole world ...
4) Hai Xi has played, and Li Guangyao Senior Minister heads a delegation to access Fujian-Fujian forum-Fujian forum | and eight ...
5) Li Zizheng: six new Congressmen accompany to experience and how to come into contacts with Chinese leader
6) MoneyLand-in new cooperation make ecological city
As can be seen from Search Results, wherein only have the 4th relevant to inquiry with the 5th, 1-3 Shi Li Senior Minister accepts the interview, and the 6th is that President Hu Jintao and Li Zizheng talk, but accesses and refer to that Premier Wen Jiabao visits Singapore.
The third way adds " country " in " Li Zizheng access ", and to avoid those his accessed message, front 6 Search Results of Baidu are as follows
1) Li Zizheng: access Malaysian will observe Malaysian's politics trend in person
2) Li Zizheng: access Malaysian will observe Malaysian's politics trend in person
3) Li Zizheng accepts United Press International (UPI) of U.S. access-Xi ancestral temple lane
4) [∞ south of the River garden ∞] accepts CNN and accesses Li Zizheng: it is unworkable that the U.S. advocates democracy to the whole world ...
5) Wu Bangguo accesses Singapore and meets with senior minister Li Guangyao _ press center _ Sina website
6) China's comment news on May 13rd, 2006 ... Singapore's senior minister Li Guangyao yesterday of accessing to Beijing and Vice President Zeng Qinghong meet.Singapore's senior minister Li Guangyao yesterday and the Vice President Zeng Qinghong in Zhong Ping society Hong Kong May 13/access to Beijing meet....
Wherein, 1,2 and 6 is correct results, and 3-4 is still Li Zizheng and accepts the interview, and the 5th is the message that Wu Bangguo accesses Singapore.
Sum up above result, the Railway Project that search engine faces can be found out:
The first, the Keywords matching of search engine cannot describe the search need of user, and such as inquiry " which country former premier Li Guangyao Senior Minister of Singapore accesses ", its meaning refers to that Li Zizheng goes access, the report that result Zhong Queyou Lee Senior Minister is accessed.
The second, the structure between keyword do not considered by search engine, such as inquires about in " which country former premier Li Guangyao Senior Minister of Singapore accesses ", have certain syntax and semantic structure, but all searched engine ignored between " Li Zizheng " and " access ".
3rd, search engine does not consider concept matching, and such as inquiry " Li Zizheng accesses country ", in its result, " Malaysian " and " China " is not all matched, and display search engine does not identify the relation between they and " country ".Another evidence is, if in one section of report, only occur " Li Guangyao ", and do not have " Li Zizheng ", inquiry " Li Zizheng access " and inquiry " Li Zizheng accesses country " all probably cannot search.
4th, the result of search engine or the text of Un-structured, only check for people, computing machine cannot directly utilize its Search Results to be further processed.Such as, state visit, when and where that user needs to check that Obama is nearest and whom has met, then need a large amount of manual information retrieval.
Obtain except sensitive information based on search engine except above-mentioned, also have some Corpus--based Method or based on the sensitive information monitoring technology manually laid down a regulation or system.Technology its performance current of Corpus--based Method still has sizable distance from embody rule; Technology based on artificial rule is then portable poor, and needs a large amount of manual operation.
Summary of the invention
The present invention is exactly for the weak point in above-mentioned background technology, and a kind of sensitive information monitoring method based on event searching proposed, utilize the search mechanisms based on event, automatically intellectual monitoring is carried out to a certain class event, filter a large amount of irrelevant information, the result obtained has higher accuracy rate.
The object of the invention is to be achieved by the following technical measures.
Based on a sensitive information monitoring method for event searching, the hardware components that the method uses comprises event recognition parts, event extraction parts, and the method comprises the following steps:
(1) event recognition parts utilize the semantic role taxonomic hierarchies of PropBank to carry out semantic character labeling to monitored text message, in order to describe event and feature thereof;
(2), after identification event, event extraction parts adopt log-linear model to carry out event extraction, comprise the key concept of formation event; In order to carry out syntactic analysis to the sentence of monitored text message, to obtain the event in this sentence;
(3) automatic search allocate event and event related notion in monitored text message, its method is utilize to specify from user the sensitive event be drawn into natural language querying, mate with the event be drawn in monitored text message, if predicate, argument and semantic role classification are all identical, be then same event, namely judge that monitored text message is as sensitive information;
(4) after a text fragment is judged as sensitive information, automatically can produce early warning information, this information will notify associated user, to take counter-measure in time by mail, note means.
In technique scheme, semantic character labeling described in step (1) adopts maximum entropy model as sorter, and annotation process is divided into two stages,
First stage identifies predicate, and which word namely identified in monitored text message sentence is predicate; Suppose C={01,02 ..., n l be senses of a dictionary entry set, wherein n l for classification number, t i for i-th senses of a dictionary entry of word w in sentence s, maximum entropy model utilizes following equations to make conditional probability p( w| s, t i ) maximum t value
For identifying predicate, adopt following characteristics: word self, interdependent classification, father, father's part of speech, child's set, the set of child's part of speech, the interdependent category set of child, left neighbours, right neighbours, the interdependent classification of left neighbours, the interdependent classification of right neighbours;
The semantic classes of subordinate phase to predicate is classified, and assorting process still adopts maximum entropy as sorter, identical when feature and predicate recognition, here, the semantic classes of predicate is with reference to the standard in PropBank, semantic character labeling process is identical with predicate recognition process, and a process is merged in argument identification and argument classification, except the feature used in predicate recognition, semantic character labeling also uses following characteristics: position is namely in predicate left or right, or self, the predicate justice class defined in PropBank, the most left and the rightest word, the most left and the rightest child, part of speech path is namely from this word to whole parts of speech of a certain predicate, arrange by access path, interdependent category path is namely from this word to the whole interdependent classification of a certain predicate, arrange by access path, common ancestor path i.e. this word is to the path of the common ancestor's process with a certain predicate, comprise part of speech path and interdependent category path.
In technique scheme, the concrete mode adopting log-linear model to carry out event extraction in step (2) is, if T is the set of relevance tree or feature structure composition, t is a concrete association tree or feature structure, proper vector f (s, t) the ∈ Rm using a m to tie up represents the t in s, separately establishes θ ∈ Rm to be the parameter vector that a m ties up, then for this sentence or phrase s, the relevance tree of its optimum or feature structure are: , optimum relevance tree or feature structure should be make these two vectorial inner products reach mxm..
Finding optimum solution is the most crucial module of acquisition event, primarily of training and decoding two module compositions.
Training module is used for the weight of estimate vector θ, if M is the number of times of iteration, RT is the relevance tree collection of sentence xi, and training algorithm comprises 2 parts, checks correctness and undated parameter; The first step detects the difference between training data and algorithm Output rusults; Second step is different according to this, readjusts parameter.
Decoder module adopts Prim algorithm to obtain relevance tree, if s= s 1, s 2, , s nthe sentence of input, wherein s irepresent a word in sentence s, G=<V, E> are non-directed graphs for s builds, and wherein V={ si:1≤i≤n } is the set of node, the corresponding node of each word; E={ (si, sj): 1≤i, j≤n} represents the set on limit, connects with a limit between any two words; According to log-linear model, the target of parsing obtains the relevance tree t* with maximal value.The corresponding Feature Words of each node in tree t, so just can eigendecomposition in t.
In characteristic Design, adopt following characteristics:
Binary feature, for arbitrary limit <w1, w2>, binary feature comprises <w1, w2>, <w1, c2>, <c1, w2>, <c1, c2>, wherein c1 and c2 is that concept cluster is as the definition in Hownet or Chinese thesaurus;
Architectural feature, any node w adjacent with <w1, w2>, then <w, w1, w2> are an architectural feature;
Contextual feature, be arranged in the word w of any position or the outer certain distance of w1 and w2 between sentence w1 and w2, then <w, w1, w2> are a contextual feature.
In technique scheme, in step (3), during to event searching, the Incomplete matching of event can be allowed, constraint condition is set simultaneously.Such as, the predicate of two events must be completely the same, and argument need not be identical, but the argument of the same role classification all occurred in two sentences must be completely the same.By the method, the recall rate of event searching can be improved.
The present invention compared with prior art its advantage is:
Traditional public sentiment monitoring system adopts the mode of keyword or keyword combination to monitor, once comprise keyword or its combination in discovery information, then issues early warning information.But the possibility of result utilizing keyword retrieval to obtain comprises a large amount of noise, and useful information may because be filtered in lists of keywords.Such as: user needs to understand the relevant public sentiment of " Sarkozy visits China ", the information that system possibly cannot monitor " French president visit this week Beijing ", but likely monitors the information of " Sarkozy accesses India ".Native system adopts the search mechanisms based on event, automatically can carry out intelligent search to a certain class event, and filter a large amount of irrelevant information, the result obtained has higher accuracy rate.
Accompanying drawing explanation
Fig. 1 is the example of an event in the embodiment of the present invention.
Fig. 2 is the core semantic role table of the predicate in embodiment of the present invention PropBank.
Fig. 3 is the process flow diagram utilizing log-linear model to extract event in the embodiment of the present invention.
Fig. 4 is the method Organization Chart of the embodiment of the present invention.
Embodiment
The present invention compared with prior art its innovation is:
1. event searching: compare keyword, event is a kind of concept structure.First it can compared with the search need of accurate description user; Secondly, its structure based, thus can meet the search precision requirement of user; Again, its concept based, therefore also can ensure the recall ratio requirement of user;
2. Search Results: from Search Results, the result of current search engine is still non-structured text, mainly checks for user; And the result of event searching is structurized, user both can be allowed to check, also can directly process further for computing machine.
Below in conjunction with drawings and Examples, the present invention is further described.
The present embodiment provides a kind of sensitive information monitoring method based on event searching, and the hardware components that the method uses comprises event recognition parts, event extraction parts, and the method comprises the following steps, as shown in Figure 4:
(1) event recognition parts utilize the semantic role taxonomic hierarchies of PropBank to carry out semantic character labeling to monitored text message, in order to describe event and feature thereof.
(2), after identification event, event extraction parts adopt log-linear model to carry out event extraction, comprise the key concept of formation event; In order to carry out syntactic analysis to the sentence of monitored text message, to obtain the event in this sentence.
(3) automatic search allocate event and event related notion in monitored text message, its method is utilize to specify from user the sensitive event be drawn into natural language querying, mate with the event be drawn in monitored text message, if predicate, argument and semantic role classification are all identical, be then same event, namely judge that monitored text message is as sensitive information; In addition, the Incomplete matching of event can be allowed, constraint condition is set simultaneously.Such as, the predicate of two events must be completely the same, and argument need not be identical, but the argument of the same role classification all occurred in two sentences must be completely the same.By the method, the recall rate of event searching can be improved.
(4) after a text fragment is judged as sensitive information, automatically can produce early warning information, this information will notify associated user, to take counter-measure in time by mail, note means.
In above-mentioned steps (1), event can regard that the multiple arguments needed for an independent predicate and the role that is associated with given example are formed as.In brief, event is made up of predicate and argument.Event plays important role in the process of text understanding.Such as, in " Illinois governor's Bradley dagger-axe Alexeyevich is arrested by federal agent by 9th ", an event is as shown in Figure 1 comprised.
This event description once " arrests " behavior, and the promoter of behavior is " federal agent ", behavior to as if " Illinois governor's Bradley dagger-axe Alexeyevich ", the time of behavior is " 9 days ".Find out thus, event be exactly sentence comprise semantic formalization result, and we can understand information expressed by text exactly by it.
Since event is made up of predicate and argument, which composition so in text can make predicate, and each predicate exists how many arguments, and generally, predicate is made up of verb or verb phrase.On the other hand, for a predicate, we need to find out which argument and this predicate actually and are connected.Namely which kind of semantic role argument plays the part of in the action represented by predicate or state.We can with reference to the semantic role taxonomic hierarchies of PropBank.PropBank constructs the structure of each predicate and argument thereof, and marks the semantic role that argument may exist.In PropBank, the core semantic role of predicate as shown in Figure 2.
In addition, PropBank also defines abundant non-core semantic role, comprises beneficiary (argM-BNE), condition (argM-CND), direction (argM-DIR), degree (argM-DGR), place (argM-LOC), mode (argM-MNR), object (argM-PRP), time (argM-TMP) etc.From the definition of PropBank to semantic role, semantic role and event relation close, the feature of certain aspect of event can be described by different semantic roles.Further, for certain concrete event, the number of its argument is exactly the number of the semantic role existed in this event.So, in " Illinois governor's Bradley dagger-axe Alexeyevich is arrested by federal agent by 9th ", " arresting " this event just contains three arguments, is agent, word denoting the receiver of an action and time.These semantic roles, as the related notion of event, contact with event establishment, and this contact is defined by its semantic look classification of separating.
Semantic character labeling adopts maximum entropy model as sorter, and annotation process is divided into two stages.
First stage identifies predicate, and which word namely in sentence is predicate.Suppose C={01,02 ..., n l be senses of a dictionary entry set, wherein n l for classification number, t i for word wat sentence sin ithe individual senses of a dictionary entry.Maximum entropy model utilizes following equations to make conditional probability p( w| s, t i ) maximum tvalue
(formula one)
For identifying predicate, adopt following characteristics: word self, interdependent classification, father, father's part of speech, child's set, the set of child's part of speech, the interdependent category set of child, left neighbours, right neighbours, the interdependent classification of left neighbours, the interdependent classification of right neighbours.
The semantic classes of subordinate phase to predicate is classified.Assorting process still adopts maximum entropy as sorter.Identical when feature and predicate recognition.Here, the semantic classes of predicate is with reference to the standard in PropBank.
Semantic character labeling process and predicate recognition process similar.It may be noted that, different from common practices, argument identification and argument classification are merged into a process by this method, the reason done like this is: if by two processes separately, then in the first stage, i.e. the word of argument cognitive phase identification error, can not appear at subordinate phase again, namely in argument classification, thus the accuracy rate of mark is reduced.In addition, except the feature used in predicate recognition, semantic character labeling also uses following characteristics: position is (namely in predicate left or right, or self), predicate justice class (defining in PropBank), the most left and the rightest word, the most left and the rightest child, part of speech path is (namely from this word to whole parts of speech of a certain predicate, by access path arrangement), interdependent category path is (namely from this word to the whole interdependent classification of a certain predicate, by access path arrangement), (namely this word is to the path of the common ancestor's process with a certain predicate in common ancestor path, comprise part of speech path and interdependent category path).
In above-mentioned steps (2), log-linear model has identical description form with maximum entropy model in essence.For a random occurrence, suppose that we have had one group of sample, we wish to set up a statistical model, simulate the distribution of this random occurrence.For this reason, we just need selection one stack features, this statistical model that we are obtained is on this stack features, completely the same with the distribution in sample, ensure again this model " evenly " (namely making the entropy of model reach maximum) as much as possible simultaneously, to guarantee except this stack features, this model does not have other any preference.The calculated amount of log-linear model reduces a lot than maximum entropy model, be one calculative strategy more flexibly, be a very important model in natural language processing.
Therefore, we use log-linear model to carry out syntactic analysis to sentence, to obtain with the event in this sentence.In general, for sentence or the phrase s of an input, all much possible relevance tree or feature structure can be had.If T is the set of relevance tree or feature structure composition, t is a concrete association tree or feature structure, and we use proper vector f (s, t) the ∈ R of a m dimension mrepresent the t in s, separately establish θ ∈ R mbe the parameter vector of a m dimension, then, for this sentence or phrase s, the relevance tree of its optimum or feature structure are:
(formula two)
Intuitively, optimum relevance tree or feature structure should be make these two vectorial inner products reach mxm..
With the process of this model extraction event as shown in Figure 3.Finding optimum solution is the most crucial module of acquisition event, primarily of training and decoding two module compositions.
The task of training module estimates the weight of vectorial θ in (formula one) exactly.If M is the number of times of iteration, RT is sentence x irelevance tree collection.Concrete learning algorithm is as follows.This algorithm comprises 2 parts: check correctness and undated parameter.The first step detects the difference between training data and algorithm Output rusults; Second step is different according to this, readjusts parameter.
Input: training examples ( x i, y i), i = 1…, N;
Initialization: θ= 0;
Output: θ;
Algorithm:
for n= 1… N, i=1… M
i) Calculate
ii) if ( z *y i) then
Output:
Decoder module adopts Prim algorithm to obtain relevance tree, if s= s 1, s 2, , s nthe sentence of input, wherein s irepresent a word in sentence s, g=< v, e> is a non-directed graph for s builds, wherein v= s i: 1≤ i≤ n} is the set of node, the corresponding node of each word; e=( s i, s j): 1≤ i, j≤ n} represents the set on limit, connects between any two words with a limit.According to log-linear model, the target of parsing obtains the relevance tree with maximal value t *.In tree t, the corresponding Feature Words of each node, so just eigendecomposition in t, can so just have following formula
(formula three)
We obtain relevance tree with Prim algorithm.Specific algorithm is as follows.
Input: G= < V, E>
Initialize: V new= { s i}, where s iV, E new= {}
Repeat until V new= V:
Compute
Add s jto V new, and ( s i, s j) to E new
Output: MST=< V new, E new>
In above-mentioned steps (2), in characteristic Design, adopt following characteristics:
Binary feature, for arbitrary limit <w1, w2>, binary feature comprises <w1, w2>, <w1, c2>, <c1, w2>, <c1, c2>, wherein c1 and c2 is that concept cluster is as the definition in Hownet or Chinese thesaurus;
Architectural feature, any node w adjacent with <w1, w2>, then <w, w1, w2> are an architectural feature;
Contextual feature, be arranged in the word w of any position or the outer certain distance of w1 and w2 between sentence w1 and w2, then <w, w1, w2> are a contextual feature.

Claims (2)

1., based on a sensitive information monitoring method for event searching, the hardware components that the method uses comprises event recognition parts, event extraction parts, it is characterized in that the method comprises the following steps:
(1) event recognition parts utilize the semantic role taxonomic hierarchies of PropBank to carry out semantic character labeling to monitored text message, in order to describe event and feature thereof; Wherein, described semantic character labeling adopts maximum entropy model as sorter, and annotation process is divided into two stages,
First stage identifies predicate, and which word namely identified in monitored text message sentence is predicate; Suppose C={01,02 ..., n l be senses of a dictionary entry set, wherein n l for classification number, d i for i-th senses of a dictionary entry of word w in sentence s, maximum entropy model utilizes following equations to make conditional probability p( w| s, d i ) maximum d value
For identifying predicate, adopt following characteristics: word self, interdependent classification, father, father's part of speech, child's set, the set of child's part of speech, the interdependent category set of child, left neighbours, right neighbours, the interdependent classification of left neighbours, the interdependent classification of right neighbours;
The semantic classes of subordinate phase to predicate is classified, and assorting process still adopts maximum entropy as sorter, identical when feature and predicate recognition, here, the semantic classes of predicate is with reference to the standard in PropBank, semantic character labeling process is identical with predicate recognition process, and a process is merged in argument identification and argument classification, except the feature used in predicate recognition, semantic character labeling also uses following characteristics: A position is namely in predicate left or right, or self, the predicate justice class that B defines in PropBank, the word that C is the most left and the rightest, the child that D is the most left and the rightest, E part of speech path namely from this by whole parts of speech of the word to a certain predicate that carry out semantic character labeling, arrange by access path, the interdependent category path of F namely from this by the whole interdependent classification of the word to a certain predicate that carry out semantic character labeling, arrange by access path, G common ancestor path namely this by the path of the word to the common ancestor's process with a certain predicate that carry out semantic character labeling, comprise part of speech path and interdependent category path,
(2), after identification event, event extraction parts adopt log-linear model to carry out event extraction, comprise the key concept of formation event; In order to carry out syntactic analysis to the sentence of monitored text message, to obtain the event in this sentence; Wherein, the concrete mode adopting log-linear model to carry out event extraction is, if T is the set of relevance tree or feature structure composition, t is a concrete relevance tree or feature structure, proper vector f (s, t) the ∈ Rm using a m to tie up represents the t in s, separately establishes θ ∈ Rm to be the parameter vector that a m ties up, then for this sentence or phrase s, the relevance tree of its optimum or feature structure are: , optimum relevance tree or feature structure should be make these two vectorial inner products reach mxm.;
Finding optimum solution is the most crucial module of acquisition event, by train and two modules of decoding form;
Training module is used for the weight of estimate vector θ, if M is the number of times of iteration, RT is the relevance tree collection of sentence xi, and training module comprises 2 parts, checks correctness and undated parameter; The first step detects the difference between training data and algorithm Output rusults; Second step is different according to this, readjusts parameter;
Decoder module adopts Prim algorithm to obtain relevance tree, if s= s 1, s 2, , s n the sentence of input, wherein s irepresent a word in sentence s, G=<V, E> are non-directed graphs for s builds, and wherein V={ si:1≤i≤n } is the set of node, the corresponding node of each word; E={ (si, sj): 1≤i, j≤n} represents the set on limit, connects with a limit between any two words; According to log-linear model, the target of parsing obtains the relevance tree t* with maximal value, and the corresponding Feature Words of each node in tree t, so just can eigendecomposition in t;
In characteristic Design, adopt following characteristics:
Binary feature, for arbitrary limit <w1, w2>, binary feature comprises <w1, w2>, <w1, c2>, <c1, w2>, <c1, c2>, wherein c1 and c2 is the definition in concept cluster or Chinese thesaurus;
Architectural feature, any node w adjacent with <w1, w2>, then <w, w1, w2> are an architectural feature;
Contextual feature, be arranged in the word w of any position or the outer certain distance of w1 and w2 between sentence w1 and w2, then <w, w1, w2> are a contextual feature;
(3) automatic search allocate event and event related notion in monitored text message, its method is utilize to specify from user the sensitive event be drawn into natural language querying, mate with the event be drawn in monitored text message, if predicate, argument and semantic role classification are all identical, be then same event, namely judge that monitored text message is as sensitive information;
(4) after a text fragment is judged as sensitive information, automatically can produce early warning information, this information will notify associated user, to take counter-measure in time by mail, note means.
2. the sensitive information monitoring method based on event searching according to claim 1, is characterized in that: in step (3), during to event searching, allows the Incomplete matching of event, arranges constraint condition simultaneously.
CN201210170863.4A 2012-05-29 2012-05-29 A kind of sensitive information monitoring method based on event searching Active CN102693314B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210170863.4A CN102693314B (en) 2012-05-29 2012-05-29 A kind of sensitive information monitoring method based on event searching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210170863.4A CN102693314B (en) 2012-05-29 2012-05-29 A kind of sensitive information monitoring method based on event searching

Publications (2)

Publication Number Publication Date
CN102693314A CN102693314A (en) 2012-09-26
CN102693314B true CN102693314B (en) 2015-07-29

Family

ID=46858747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210170863.4A Active CN102693314B (en) 2012-05-29 2012-05-29 A kind of sensitive information monitoring method based on event searching

Country Status (1)

Country Link
CN (1) CN102693314B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572958B (en) * 2014-12-29 2018-10-02 中国科学院计算机网络信息中心 A kind of sensitive information monitoring method based on event extraction
CN105956740B (en) * 2016-04-19 2019-12-31 北京深度时代科技有限公司 Semantic risk calculation method based on text logical features
CN107451158B (en) * 2016-06-01 2021-01-19 中国科学院地理科学与资源研究所 Method for extracting semantic roles of traffic events in web text
CN107784008A (en) * 2016-08-29 2018-03-09 百度在线网络技术(北京)有限公司 Method and apparatus for pushed information
CN107818195B (en) * 2017-08-24 2021-04-06 宁波大学 3D printing filling path generation method based on association tree
CN109977228B (en) * 2019-03-21 2021-01-12 浙江大学 Information identification method for power grid equipment defect text

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908042A (en) * 2010-08-09 2010-12-08 中国科学院自动化研究所 Tagging method of bilingual combination semantic role
CN101937430A (en) * 2010-09-03 2011-01-05 清华大学 Method for extracting event sentence pattern from Chinese sentence

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908042A (en) * 2010-08-09 2010-12-08 中国科学院自动化研究所 Tagging method of bilingual combination semantic role
CN101937430A (en) * 2010-09-03 2011-01-05 清华大学 Method for extracting event sentence pattern from Chinese sentence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于条件场的语义角色标注;颜廷义;《中国优秀硕士学位论文全文数据库》;20111231;3-56 *

Also Published As

Publication number Publication date
CN102693314A (en) 2012-09-26

Similar Documents

Publication Publication Date Title
Sharifani et al. Operating machine learning across natural language processing techniques for improvement of fabricated news model
Yang et al. A hybrid retrieval-generation neural conversation model
US20190065576A1 (en) Single-entity-single-relation question answering systems, and methods
Li et al. Multi-class Twitter sentiment classification with emojis
CN102693314B (en) A kind of sensitive information monitoring method based on event searching
Chen et al. Knowledge-enhanced neural networks for sentiment analysis of Chinese reviews
Sánchez et al. Content annotation for the semantic web: an automatic web-based approach
Vicient et al. An automatic approach for ontology-based feature extraction from heterogeneous textualresources
Stojanovski et al. Deep neural network architecture for sentiment analysis and emotion identification of Twitter messages
CN102253982B (en) Query suggestion method based on query semantics and click-through data
US20150254230A1 (en) Method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
US20060026152A1 (en) Query-based snippet clustering for search result grouping
Van de Camp et al. The socialist network
Sahu et al. Prashnottar: a Hindi question answering system
CN103136192B (en) Translate requirements recognition methods and system
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
Banerjee et al. Towards analyzing micro-blogs for detection and classification of real-time intentions
CN104484380A (en) Personalized search method and personalized search device
Ko et al. Combining evidence with a probabilistic framework for answer ranking and answer merging in question answering
Reganti et al. Modeling satire in English text for automatic detection
Das et al. Temporal analysis of sentiment events–a visual realization and tracking
Roy et al. Clustering and labeling IT maintenance tickets
Verberne et al. Automatic thematic classification of election manifestos
Abid et al. Semi-automatic classification and duplicate detection from human loss news corpus
Asgari-Chenaghlu et al. Topicbert: A transformer transfer learning based memory-graph approach for multimodal streaming social media topic detection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20160722

Address after: 430223, No. 8, Wuhan international road, 78 Optics Valley Road, Jiangxia District, Hubei, China

Patentee after: WUHAN HUAAN SCIENCE AND TECHNOLOGY CO., LTD.

Address before: 17, building 430000, block A, Hubei bank building, No. 81 North Central Road, Wuchang District, Wuhan, Hubei

Patentee before: Dai Song