CN103854064B - Event occurrence risk prediction and early warning method targeted to specific zone - Google Patents

Event occurrence risk prediction and early warning method targeted to specific zone Download PDF

Info

Publication number
CN103854064B
CN103854064B CN201210501874.6A CN201210501874A CN103854064B CN 103854064 B CN103854064 B CN 103854064B CN 201210501874 A CN201210501874 A CN 201210501874A CN 103854064 B CN103854064 B CN 103854064B
Authority
CN
China
Prior art keywords
event
information
pronoun
nounoun
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210501874.6A
Other languages
Chinese (zh)
Other versions
CN103854064A (en
Inventor
杨风雷
黎建辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Priority to CN201210501874.6A priority Critical patent/CN103854064B/en
Publication of CN103854064A publication Critical patent/CN103854064A/en
Application granted granted Critical
Publication of CN103854064B publication Critical patent/CN103854064B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses an event occurrence risk prediction and early warning method targeted to a specific zone. The method comprises the following steps: (1), carrying out filtering on crawled webpage information; (2), parsing words expressing location in the webpage information to obtain a place name word; and carrying out processing on the webpage information based on an established information body and classifying the webpage information into a matching region; (3), carrying out processing on the webpage information by using a regression analysis model and determining an object type related to each webpage; (4), obtaining webpage information sets of events of a set region and a set object according to a located zone of the webpage as well as the related object type, and establishing event feature parameters and calculating parameter values regularly, and carrying out early warning on an event whose feature parameter value exceeds a set threshold value; and (5), on the basis of a matrix analysis and a regression prediction model, carrying out different early warning on a risk of the set event occurrence at the target zone on under the circumstances of occurrence of set object event early warning at a certain zone. According to the invention, the efficiency of risk early warning is improved.

Description

A kind of event occurrence risk prediction towards specific region method for early warning
Technical field
The invention belongs to areas of information technology, more particularly, to a kind of carry out specific place to crawling the internet information obtaining Reason, the method being predicted and carrying out early warning based on the risk that this occurs particular event to specific region, it is mainly used in food In the emergency processing work of the unconventional accidents such as safety information monitoring, Risk-warning.
Background technology
In recent years, food safety affair toxic capsule, twice-cooked stir-frying oil, clenbuterol hydrochloride, dyeing steamed bread, plasticiser, malicious Fructus Cucumidis sativi etc. Again and again occur, this had both caused extremely bad social influence, also brought substantial amounts of economic loss.In order to avoid or to greatest extent Reduce the harm that these food safety affairs are brought, start to have obtained very big concern based on the Risk-warning technology of event.For Carry out the Risk-warning based on event, this is accomplished by finding the information of these events in advance.
With the fast development of internet, the Internet netizen's quantity is more and more huger, and the Internet is increasingly becoming netizen and sends out The main carriers of cloth information, acquisition information and transmission information, and define one by the interaction between people, tissue etc. and show There is the virtual society of certain correspondence, incidence relation in real society.It has had changed into worldwide largest common data source, And its scale also ceaselessly increases.Under this situation, using the Internet itself feature it is established that perfect society letter Breath feedback network, finds various " possible trouble " factors that may bring crisis in advance, and the contingency management for food safety affair provides In time, accurately, comprehensive information just seems imperative and has very important meaning.
It is the Risk-warning carrying out food safety affair using the information on the Internet, need to obtain through certain process The related information of event.This crawls firstly the need of carrying out internet information, can carry out the related letter of food safety affair afterwards Breath extracts, finds work, to be developed can carry out early warning to a certain extent afterwards.Inside such a process, key therein Step is by the identification of event information, and this can be by the various machine learning having supervision or unsupervised machine in theory Learning method is realized, but combines the consideration of actual information requirement and accuracy, operability etc., often takes some accommodations Measure.Than the mode taken if any research work it is: set up some information classifications (such as disease) in advance, receive for each classification Collect some key words, afterwards to the info web collected based on these classification and key word, take Keywords matching Mode carries out information classification, and monitoring classification information is the development of event on this basis.Also research work is had to employ information Correlation detection, name Entity recognition, the information retrieval using disease and address, the step such as visual displaying result carry out The identification of event information and the way of judgement.
In terms of the result of evaluation test, the judgement of event information, identification, early warning etc. in above-mentioned way also exists in performance Not enough (parameter such as accuracy rate, recall rate needs to be improved further).In this regard, if it is considered that in non-consideration information in said method Exist the impact of various junk information, the accuracy of information extraction technology is not yet sufficiently high up to now and directly will pass through The classification information obtaining after Keywords matching there may be the corresponding main body of information not in the way as same event information Consistent the problems such as, the deficiency that said method exists in performance is also just not at all surprising.
Further, after finding that method extracts relevant range institute event information by event information, if Specific region can be occurred the risk of particular event (currently not yet occurring) to be predicted, whether this kind of thing can be occurred to it Part, and how long be predicted and early warning it may happen that waiting afterwards, the Risk Monitoring of specific region and early warning will be had Very important meaning.By consulting literatures, not yet find such research.
Content of the invention
Above-mentioned for solving the problems, such as, it is an object of the invention to provide a kind of take the content to info web for the particular step It is analyzed, the method that the risk to specific region, particular event occurring afterwards is predicted simultaneously early warning.Intelligence is used for reference in method Systems approach, the step of formation is as described below.
1st, set up body
The feature of based food security incident and the needs of later stage information analysiss, from object, region, result, association person, when Between etc. dimension set up food safety affair Information Ontology.Thus providing for the information filtering of food safety affair, INFORMATION DISCOVERY etc. Basis.
2nd, information filtering
On the basis of the body of above-mentioned foundation, filter to crawling the info web obtaining.Filter process is broadly divided into Two parts: food security information filters, garbage information filtering.Wherein the former is mainly by the title of information, content etc. The method taking pattern match determines if to belong to food security information;The latter is mainly practised fraud to by content, link Unrelated suggestion in junk information and user-generated content, low quality suggestion and fraudulence rubbish suggestion pass through the inspection set up Survey model to be filtered.Thus ensureing to enter the quality of the information of subsequent process.
3rd, area information finds
On the basis of the area information body of above-mentioned foundation, carry out ground to the title of the information after crawling, filtering, content etc. After the parsing such as nounoun pronoun, take pattern match, enter row information correlation area based on the judgement recognition methodss of machine learning judgment models The discovery in domain determines.
4th, object information finds
Based on the prior regression analysis model set up, pin after the steps such as participle, dimensionality reduction is carried out to the title of information, content etc. Regression analyses are carried out to each object type (being previously set, such as vegetable), determines whether info web has with destination object with this Relation.Thus finding related object type of information etc..Thus, calmodulin binding domain CaM information, object type information etc., you can align True determination event.
5th, trend tracking, event early warning
After information filtering, area information discovery, object information find, in the characteristic parameter setting up expression event such as On the basis of page number, page browsing number, aggregative index etc., by the regular method calculating affair character parameter value to event Development trend is tracked;And the average in each characteristic ginseng value and its regular period previous current to event is compared, If difference is just and absolute value is persistently more than certain threshold value, carry out event early warning.
6th, event terminates to judge
Event to early warning, periodically calculates each characteristic ginseng value of event, and by each characteristic ginseng value current for event and its In regular period, the average of (from early warning day) is compared before, if difference is negative and absolute value is more than certain threshold Value, then terminate the early warning for this event.
7th, the prediction of target area event occurrence risk, early warning and displaying
Based on the area distribution having occurred and that particular event, using the method for matrix decomposition and logistic regression analyses, Particular event and possible time of origin etc. whether can not occurred to be analyzed to target area (currently occurring) and predict, and Carry out different Risk-warnings according to predicting the outcome;Afterwards by the result of early warning analysis show correlation user and for user take Business.
8th, body supplements and revises
In view of the changes in distribution feature of internet information, from the angle of constantly improve method efficiency, periodically to letter The result of the processes such as breath filtration, region and object information discovery is estimated, and based on this, deficiency in body is such as omitted, Mistake etc. is supplemented, is revised, to improve follow-up method efficiency.
The present invention be guarantee information filter, INFORMATION DISCOVERY accurate, efficient, establish that to meet food safety affair information special The body of point, is mainly carried out from object, result, region, time, the several dimension of association person during setting up body.Wherein, for Each example of area information body, establish respectively area code, postcode, referred to as, showplace, adjacent domains, place orientation The add list of six dimensions.
The present invention is to improve the accuracy that event information finds, is carrying out subsequent treatment to crawling the internet information obtaining Before, first information filtering process is carried out to it, including food security information filtration, garbage information filtering.
The present invention, in order to improve the accuracy that the identification of info web relevant range judges, carries out pre- place to info web first After reason, related resolution is carried out to obtain clear and definite word to the correlation word of possibly place name, pass through pattern match and judgement afterwards The modes such as model judgement judge whether information can be included into target area, thereby determine that info web relevant range.
The accuracy that the present invention determines to improve info web relevant range to judge, believes for the webpage after pretreated Breath has carried out ground nounoun pronoun parsing, relative position parsing, non-standard Shaping etc. and has processed, thus solving non-standard ground noun Language, the low problem of the info web relevant range accuracy of judgement degree that brought such as nounoun pronoun, relative position.
The present invention judges in determination process in info web relevant range, employs the pattern for heading message successively Method of completing the square, for text message method for mode matching, letter is carried out based on the method that the judgment models of machine learning are judged The judgement of breath relevant range.Wherein, in the method being judged based on the judgment models of machine learning, by integrated region Judgment models are entered row information relevant range and are judged, it is to avoid of the same name, brought with word contrary opinion (such as generally word is as place name) etc. The inaccurate problem of region decision.
The present invention in object information discovery procedure, based on the prior regression analysis model set up, to the title of information, interior Hold etc. carry out the steps such as participle, dimensionality reduction after carry out regression analyses for each object type, with this determine info web respectively with which A little object type have relation.
The present invention periodically calculates the relation between each characteristic ginseng value of event and the average in the range of its certain time previous, When difference is that (such as 3 times of standard deviations) carry out timely event early warning just and when absolute value lasts up to a certain extent.
The present invention periodically calculates its each characteristic ginseng value to the event of early warning, and by each characteristic ginseng value current for event and In its regular period previous, the average of (from early warning day) is compared, if difference is negative and absolute value is more than certain threshold Value, then terminate the early warning for this event.
The present invention based on the area distribution having occurred and that particular event, using matrix decomposition and logistic regression analyses Whether method, can occur particular event and possible time of origin etc. to be analyzed and in advance to target area (currently not occurring) Survey, and carry out different Risk-warnings according to predicting the outcome.
Compared with prior art, advantages of the present invention:
The present invention is by setting up food safety affair Information Ontology and its add list, and obtains to crawling on this basis Internet information takes the skills such as information filtering, area information discovery, object information discovery, event early warning, risk profile early warning Art is processed it is ensured that food safety affair INFORMATION DISCOVERY and early warning, the prediction of target area event occurrence risk and early warning Accuracy and comprehensive it is ensured that the efficiency of food safety risk early warning.
Brief description
A kind of event occurrence risk prediction towards specific region of Fig. 1 the method flow diagram of early warning;
Fig. 2 area information body add list schematic diagram;
The recognition methodss flow chart of Fig. 3 info web correlation region;
Fig. 4 info web correlation region determination methods schematic diagram;
The info web correlation region determination methods schematic diagram based on machine learning model for the Fig. 5;
Fig. 6 event method for early warning schematic diagram;
The event risk prediction of Fig. 7 target area, method for early warning schematic diagram.
Specific embodiment
The specific embodiment of the present invention is as shown in figure 1, concrete steps are described below.
1st, set up body
The needs of the analysis such as the feature in view of food safety affair and late events information retrieval, tracking, in food In the building process of security event information body, mainly consider to build from object, region, time, result, five dimensions of association person Vertical.Such as object instant food, can be divided into the classifications such as head product, converted productss, and head product can be divided into the classes such as veterinary antibiotics again Not, by that analogy;Such as result can be divided into the classifications such as pollution, poisoning, and pollution can be divided into the classification such as expired, exceeded, with this again Analogize;Five classifications can be divided on such as region populations, be Asia, Europe, A Feili california, America respectively Continent, Oceania;Each classification can be finely divided again, such as Asia can be divided into East Asia, West Asia, South Asia, north Asia, in Sub-, six, Southeast Asia classification, by that analogy;Until be categorized into only to be further divided into, the element of an as bottom is (i.e. real Example).The building process of other classifications is similar to.Meanwhile, for each example in body, establish respectively corresponding synonym, The add lists such as antonym, another name word;Additionally, for the example in area information body, establishing area code, postal volume respectively Code, referred to as, showplace (mountain, lake, sea, river, island, building), adjacent domains (the adjacent peer domain in the direction such as east, south, west, north), institute In the add list (as shown in Figure 2) of orientation (relatively for upper level, such as middle part, south etc.) six dimensions, in case follow-up believe Use in breath processing procedure.
2nd, information filtering
To specific information source, using internet information crawl technology (such as general crawl, the skill such as limited range crawls Art) information in information source is crawled.In view of there may be on a website and the incoherent content of predetermined theme, with And there may be the situation of various junk information, find to improve event information, the accuracy of early warning, after information is carried out Before continuous process, first information is filtered.Whole filter process is divided into two aspects: instant food safety information filters, rubbish Rubbish information filtering.
Food security information filters, that is, judge whether gathered information belongs to the related information of food safety.Here Need to consider two problems: range of information, filtering rule.With regard to filtering rule, the food safety affair information based on foundation is originally Body, during primary consideration and two dimensions of result, the specific title of instances of ontology passing through to combine this two dimensions, Attribute etc. takes the method for pattern match to be filtered;The pattern match concrete grammar taken in method include Boolean matching, Distance coupling between frequency matched, instance name, the mode such as instance name synonymous antisense coupling, instance name alias match;Tool The mode of body selects and specific rules are set up by determining (be determined in advance and regularly update) after Information Statistics are analyzed.With regard to letter The selection of breath scope, mainly considers the title of information, two dimensions of information content it is contemplated that message header and information content here There may be unmatched situation, first the title of information is processed in concrete processing procedure, if through believing to title After breath filters, information can be included into food security information classification, then this information is disposed;The otherwise content to information Carry out secondary judgement process.
Web rubbish can be divided into two kinds of rubbish suggestion in the web rubbish page and user-generated content.Wherein, web rubbish page Face can be divided into the content cheating page, the link cheating page;Rubbish suggestion is of different sizes according to its negative effect, can be classified as not Credible suggestion, low quality suggestion, unrelated suggestion.Insincere suggestion, that is, fraudulent suggestion, on the one hand show as to specific Object, event, personage etc. be given and do not meet superelevation evaluation, compliment of practical situation etc.;On the other hand it is right to be likely to show as Specific object, event, personage etc. are given and do not meet the ultralow evaluation of practical situation, abuse, attack etc..Low quality suggestion, this Kind of the general length of suggestion content is shorter, its content be probably useful it is also possible to useless, but because its content is to specific Topic/product description in detail it is impossible to determine very much the meaning of its opinion mining to specific topics/product, therefore also recognize For being a kind of rubbish suggestion (for computer).Unrelated suggestion, this kind of suggestion be mainly shown as advertisement or and topic no The content closed.
To the low quality suggestion in the web rubbish page of a website, user-generated content, unrelated suggestion etc. it is contemplated that its Characteristics of spam is relatively obvious, can extract content, the content of sample based on the prior sample set through mark set up The feature of the dimensions such as distribution, link (needs before extraction feature info web is carried out with meta-data extraction, text extraction, participle, sentence Statistics, paragraph statistics, Anchor Text statistics, link statistics etc. are processed) after set up detection model and detected.With regard to content dimension Feature, employs in this method and the information extracting is carried out with participle, removes stop words and (can adopt document frequency through dimensionality reduction Rate method, information gain method etc.) form content feature vector-flexible strategy afterwards for term frequencies;With regard to distribution of content feature, in this method Employ the length for heading (number of characters) of information, paragraph number, sentence number, bout length (average), sentence length (average), information Length (number of characters), Anchor Text number, Anchor Text length (number of characters-average) etc. (are set up in model process, are carried out normalizing to feature Change is processed, and process is y=x/ (max+1), and wherein x, y is the eigenvalue before and after normalization respectively, and max is in advance to site information Maximum obtained by sample statistics this feature in set;Max parameter update before if there is x > max when, then take x=max+ 1, i.e. y=1);With regard to link dimension feature, go out in the website employing information in this method chain number account for always go out chain number ratio, The website of information go out chain number account for always go out chain number ratio, the Information Number in Info Link rubbish page set (building in advance) accounts for always Go out chain number ratio, the quantity of rubbish page set (building in advance) this information of internal chaining accounts for total page number ratio etc..For above-mentioned The feature of three dimensions, based on the prior junk information set set up and non-spam set, forms characteristic vector simultaneously respectively Take machine learning method (such as support vector machine etc.) set up junk information detection model (three, based on update sample Set regularly updates model), freshly harvested information can be filtered with (rule that information is judged as junk information is afterwards The testing result of at least two of which model is positive example).
Meanwhile, to one website user generate content in fraudulence rubbish suggestion it is contemplated that characteristics of spam be not it is obvious that Follow the principle that it is not excessive to be would rather be scarce and (will ensure the standard of fraudulence rubbish suggestion sample during setting up rubbish suggestion sample set Really property), (main during this to the information of possibly fraudulence rubbish suggestion in conjunction with modes such as the examination & verification in knowledge based storehouse, investigations In user-generated content to be paid close attention to, content repeats or the approximate suggestion repeating, issue suggestion amount highest in the range of certain time Suggestion that top-n1 author is issued, the related meaning of suggestion amount highest top-n2 special object in the range of certain time See, issue the related suggestion in suggestion amount highest top-n3 ip address in the range of certain time, issue meaning for special object See that suggestion that top-n4 earliest user issued and most top-n5 of suggestion times of revision for special object use The suggestion that family is issued, and form candidate's fraudulence rubbish suggestion set) carry out examination & verification confirmation.Specifically take two methods Confirmed, one kind is positive confirmation, one kind is reverse confirmation.So-called positive confirmation, if argument information content and fraudulence Information in rubbish suggestion knowledge base describes same part thing, that is, in information content and fraudulence rubbish suggestion knowledge base The description of certain information matches, then for fraudulence rubbish suggestion.Data entries in fraudulence rubbish suggestion knowledge base increase rule For: for an argument information, prove through process after a while or afterwards, really the taking advantage of of the information that certain user is issued The suggestion of deceiving property, adds in knowledge base.Such as in certain forum someone releases news certain brand milk, contain tripolycyanamide, but Someone enumerated a variety of reasons and illustrated that this was impossible later, proved that the latter is that the interior employee of certain brand milk company takes advantage of afterwards Caused by deceiving, thus can confirm that this argument information is fraudulence junk information, in addition knowledge base, (knowledge base builds and fixed in advance Phase updates).So-called reverse confirmation, that is, it is impossible for this type of information under normal circumstances existing, thus from reverse Angle proves fraudulence rubbish suggestion.Such as reversely confirm the rule in knowledge base (build in advance and regularly update) For: a certain user id (such as 1 minute) in setting time has issued more than n (such as 10) bar meaning to one or more product See information, then these argument information that this user is delivered are labeled as fraudulence rubbish argument information.This rule can be mated One example is: in a certain forum, a certain user id has issued 15 evaluations in the time less than 1 minute to 3 kinds of different products Information, from the point of view of a normal person, this is impossible.Therefore, demonstrate what this user was issued from reverse angle The fraudulence of these information.The information being confirmed by said method is labeled, and forms accurate fraudulence rubbish suggestion collection Close, simultaneously for the frequent user issuing fraudulence rubbish suggestion, that is, issue n most user of fraudulence rubbish suggestion, will It is added to blacklist in case later stage identification uses;In addition, according to accurate fraudulence rubbish suggestion set etc., concluding suggestion author Atypical behavior (such as above-mentioned user issued 15 information etc. for 3 kinds of products in 1 minute) formation rule, in case after With.Notice clear and definite confirm a suggestion be non-fraudulence rubbish suggestion there is also suitable difficulty (for an information it is impossible to Clearly be shown to be fraudulence rubbish suggestion may also mean that can not explicitly stated its be not fraudulence rubbish suggestion), examine Consider the factors such as the multiformity that time, workload and non-fraudulence rubbish suggestion exist, here not to non-fraudulence rubbish Suggestion is labeled.
After establishing accurate fraudulence rubbish suggestion set, from the point of view of judging identification fraudulence rubbish suggestion, at present Detection model is set up after needing to select machine learning method, sample drawn feature.Notice and obtained warp through above-mentioned process Cross the fraudulence rubbish suggestion set of mark, and the argument information set without mark, but not through the non-deception of mark Property rubbish suggestion set.This means that and simply can not adopt general Supervised machine learning method, because it sets up mould Type needs to be provided simultaneously with positive example, counter-example set.So we are employed herein a kind of " from positive example and no labeled data learning " Machine learning method-biasing svm (liu, b., y.dai, x.li, w.lee, and p.yu.building text classifiers using positive and unlabeled examples.proceedings of ieee international conference on data mining,2003.).
The determination of sample characteristics during setting up with regard to detection model, mainly considers from four dimensions in the present invention: suggestion Author, suggestion content, suggestion distribution of content, chain feature four dimensions (need before extraction feature to carry out author etc. to info web Meta-data extraction, text extraction, participle, part-of-speech tagging, name entity extraction, sentence statistics, paragraph statistics, punctuation mark system Meter, link statistics etc. are processed).Wherein the determination method with regard to suggestion content characteristic is: the argument information extracting is carried out Participle, removes stop words, and forms content feature vector after dimensionality reduction (can adopt document frequency method, information gain method etc.) (flexible strategy are term frequencies);System of selection with regard to suggestion distribution of content feature is to select: suggestion paragraph number, bout length are (all Value), sentence number, sentence length (average), word number, first person pronoun number, second person pronoun number, third person pronoun number etc. (set up in model process, feature is normalized, process is y=x/ (max+1), before wherein x, y are normalization respectively Eigenvalue afterwards, max is in advance to the maximum obtained by sample statistics this feature in site information set;In max parameter more Newly front if there is x > max when, then take x=max+1, i.e. y=1);Feature selection approach for suggestion author's dimension is choosing Select: suggestion user name (number of characters), suggestion issuing time (apart from the time interval of same day zero point), suggestion issuing time interval (comparing with a upper information), suggestion number of words, suggestion number/hour (till this information), suggestion number of words changing ratio (and A upper information is compared), suggestion number changing ratio (till this information, comparing with upper one hour) etc. (set up model mistake Cheng Zhong, is normalized to feature, and process is y=x/ (max+1), and wherein x, y is the eigenvalue before and after normalization respectively, Max is in advance to the maximum obtained by sample statistics this feature in site information set;Max parameter update before if there is During x > max, then take x=max+1, i.e. y=1);System of selection for the chain feature dimension of argument information is to select: suggestion The net of chain number, argument information is entered outside the website going out chain number, argument information in the website entering chain number, argument information in the website of information Stand the Information Number that outgoing chain number, argument information link in accurate fraudulence rubbish suggestion set, accurate fraudulence rubbish suggestion collection In conjunction, quantity of Info Link argument information etc. (is set up in model process, feature is normalized, process is y=x/ (max+1), wherein x, y are the eigenvalue before and after normalization respectively, and max is in advance to this spy of sample statistics in site information set Maximum obtained by levying;Max parameter update before if there is x > max when, then take x=max+1, i.e. y=1);For above-mentioned The feature of four dimensions, the accurate fraudulence rubbish suggestion set set up based on above-mentioned steps and no mark sample set (are used Family generates the set of other samples composition in content page set), form characteristic vector respectively and set up detection model (four Individual, model is regularly updated based on the sample set updating).
Afterwards can to newly crawl the user-generated content information obtaining carry out fraudulence rubbish suggestion identification filter.First First carry out blacklist identification, to belonging to the information that in blacklist, user issues, Direct Recognition is fraudulence rubbish suggestion;For surplus Remaining suggestion, the rule concluded according to aforementioned process according to reverse confirm (i.e. existing under normal circumstances, this type of information is Impossible, thus proving fraudulence rubbish suggestion from reverse angle) mode be identified, for abnormal meaning See, be identified as fraudulence rubbish suggestion;Mould is detected for the fraudulence rubbish suggestion that remaining suggestion is set up as procedure described above Type is identified, and identification process is that argument information is carried out respectively with the judgement of four models, if at least three models judge For positive example, then this information is identified as with fraudulence rubbish suggestion.
After above filtration step, (instant food is safety-related for the information participating in follow-up processing procedure Non-spam) relative mass is higher, and this accurately provides the foundation for what follow-up was processed.
3rd, area information finds (as shown in Figure 3)
(1) info web pretreatment
Obtain and filtered info web to crawling, extract its title, source, author, issuing time, issuing web site institute In metadata informations such as ground and preserve, the body matter simultaneously extracting info web is preserved.
To extract info web title, body matter, using segmenter it is carried out based on statistics and dictionary (include according to Form dictionary of place name according to the body that step 1 is set up) participle (and record the literary composition that word relative information title and body matter are constituted This starts, terminate relative position, affiliated sentence, the relatively characteristic parameter such as relative position of sentence beginning and end), adopt afterwards With based on vocabulary (vocabulary arranges in advance and is formed and regularly update, including at the same time as name and place name word, have it His specific meanings but be also likely to be word of place name etc. simultaneously;One city of such as Wuzhong-Ningxia Hui Autonomous Region, can be simultaneously Name;One county of Founder-Heilongjiang Province, can be upright company simultaneously;Although note that the word such as Wu containing specific suffix Loyal city is then not excluded) matching process the word that may not be place name is excluded.
(2) nounoun pronoun parsing
There may be in the web page title information of participle, text message some represent places pronoun, such as this province, This city, this province etc..Itself directly cannot show exact geographic location because these pronouns are literal it is therefore desirable to solve to it Analysis.
1) it is the parsing carrying out ground nounoun pronoun, initially sets up the sliding window of pronoun parsing, sliding window length l is true in advance Fixed (determining such as after the word number distribution situation between analytically nounoun pronoun and its antecedent).
2) after selectively in l word before nounoun pronoun with the presence or absence of rational geographical term (corresponding the Liao Dynasty of such as this province Peaceful etc., based on the prior rule judgment set up), if it is present adopting between the geographical term of following foundation and ground nounoun pronoun Judged with the presence or absence of the judgment models referring to relation, if there is the relation that refers to, then determined pronoun pair according to referring to relation The geographical term answered, parsing terminate (if there is multiple refer to relation establishment geographical terms, then chosen distance ground nounoun pronoun Near geographical term), otherwise carry out step 3).
3) if there is not rational geographical term or model in l word to judge that referring to relation does not exist, and selects In 2l word before ground nounoun pronoun, (without departing from whole sentence, such as being identified with fullstop) whether there is rational geographical term, such as Fruit exists, then sentenced with the presence or absence of the judgment models referring to relation using between the geographical term of following foundation and ground nounoun pronoun Disconnected, if there is the relation that refers to, then the corresponding geographical term of pronoun is determined according to the relation that refers to, parsing terminates (if there is many The individual geographical term referring to relation establishment, then the nearest geographical term of chosen distance ground nounoun pronoun), otherwise carry out step 4).
4) if there is not rational geographical term or model in 2l word to judge that referring to relation does not exist, basis The information source obtaining in metadata extraction process or website location are using the method definitely nounoun pronoun extracting or replace Refer to place name.
The method for building up of judgment models: the info web compiling inclusively nounoun pronoun etc. forms sample set, and right Geographical term in each ground nounoun pronoun and its individual word of 2l (rapid 1) of l length sync) previous in sample set information (without departing from Sentence range) between the relation that refers to be labeled, as class variable;To each ground nounoun pronoun in sample set information and its Relation between geographical term (without departing from sentence range) in 2l (rapid 1) of l length sync) individual word extracts dependency number before According to, set up message sample with regard to this over the ground between nounoun pronoun and geographical term relation characteristic vector: include geographical term suffix (suffix represents place name or has place name feature, " autonomous region " in such as " Xinjiang Uygur Autonomous Regions ") length (suffix Number of words is divided by text size), geographical term and ground the distance between nounoun pronoun (word number is divided by text size), geographical term distance Relative distance (word number is divided by text size) that text starts, nounoun pronoun start apart from text relative distance (word number divided by Text size), geographical term start apart from sentence relative distance (word number is divided by text size), nounoun pronoun open apart from sentence (word number is long divided by text for the relative distance that the relative distance (word number is divided by text size) of beginning, geographical term terminate apart from sentence Degree), the relative distance (word number is divided by text size) that terminates apart from sentence of nounoun pronoun etc.;Select machine learning method afterwards Whether (such as svm) is set up between geographical term and ground nounoun pronoun based on above-mentioned sample set, class variable and characteristic vector There are the judgment models of the relation that refers to.
Based on the method that judgment models are judged with the presence or absence of the relation that refers between nounoun pronoun and geographical term over the ground it is: The related data extracting relation between geographical term and ground nounoun pronoun first forms characteristic vector, and the data of extraction specifically includes ground (word number is divided by literary composition for the distance between reason noun suffix lengths (suffix number of words is divided by text size), geographical term and ground nounoun pronoun This length), geographical term start apart from text relative distance (word number is divided by text size), nounoun pronoun start apart from text Relative distance (word number is divided by text size), (word number is long divided by text for the relative distance that starts apart from sentence of geographical term Degree), the phase that terminates apart from sentence of the relative distance (word number is divided by text size) that starts apart from sentence of nounoun pronoun, geographical term Adjust the distance (word number is divided by text size), the relative distance (word number is divided by text size) that terminates apart from sentence of nounoun pronoun etc.. Be identified judging based on the judgment models of above-mentioned foundation afterwards, and according to judged result definitely nounoun pronoun and geographical term it Between the relation that refers to whether there is.
(3) non-standard words parsing
The word that there may be some expression places in the web page title information of participle, text message employs In off-gauge linguistic form, such as Chinese text, beijing, bj etc. occur.In this regard, based on the standard word and non-standard set up Word synopsis (is set up in advance and is regularly updated), to off-gauge place name word form by way of being replaced after inquiry Parsed.
(4) relative position parsing
The word that there may be some expression places in the web page title information of participle, text message employs relatively The expression way of position, such as southwest China province etc..Likewise, these Expression of languages do not have clear and definite place name name yet Claim.For solving this problem, based on the area information instances of ontology set up in step 1 and its add list, to these relative position areas Domain information is inquired about and is parsed, and obtains accurate place name word (such as to southwest China province, in conjunction with the region letter set up Breath body, first looks for the province title belonging to China, and inquires about the attached of its place azimuth dimension to the province belonging to each Plus table, the province that all places orientation is southwest extracts, and substitutes southwest China province accordingly, completes to parse).
(5) region determines
Info web has been carried out can enter after pretreatment and related resolution with the determination work of row information associated area, this During mainly include two steps: be respectively adopted pattern match, machine learning judgment models enter sentencing of row information relevant range Disconnected (as shown in Figure 4).
What region determined aims at identification information relevant range, and the discovery for food safety affair information provides region base Plinth.Consider the problems such as accuracy, amount of calculation and operability, the method taking pattern match during this first is entered OK.Here need to consider two problems: range of information, matched rule.With regard to matched rule, the area information based on foundation is originally Body (i.e. region dimension dimension in body), during main consider part body instance name, attribute etc., specific pass through combination The title of these instances of ontology, attribute etc. take the method for pattern match to be judged;The pattern match tool taken in method Body method includes the modes such as the distance coupling between Boolean matching, frequency matched, instance name;Specific mode selects and specifically advises Then set up by determining (be determined in advance and regularly update) after Information Statistics are analyzed.With regard to the selection of range of information, lead here The title of information to be considered, two dimensions of information content are it is contemplated that message header and information content there may be unmatched feelings Condition, is processed to the title of information in concrete processing procedure first, if adopting above-mentioned pattern match to the title of information After method is processed, information can be included into currently selected region (such as Beijing), then the pattern match being directed to this region is processed Finish;Otherwise for this region, quadratic modes matching treatment is carried out using above-mentioned method for mode matching to the content of this information. Follow the principle that it is not excessive to be would rather be scarce during this, ensure the accuracy of identification judged result as far as possible.
If through above-mentioned pattern matching process, this information cannot be included into a certain region, then adopt based on machine learning The region decision model that method is set up carries out third time and judges to determine.The process setting up region decision model in advance is: based on whole The info web sample set that reason (same to step (1)-(4)), mark (whether associated with certain region) are crossed (is set up and regular in advance Update), by the title of message sample, content word (select and instances of ontology title, attributes match word) combine- By these words according to administrative place name (referring to province, city etc.), area code, postcode, referred to as, showplace (mountain, lake, sea, river, island Small island, building etc.) five classifications carry out sorting out five characteristic vectors of composition (wherein in vector term weighing be term frequencies it is considered to To the importance of title word, pre-determined multiple is multiplied by the weight of title word).Afterwards, using machine learning method (support vector machine etc.) each target area is set up region decision model based on above-mentioned five characteristic vectors (5, based on more New sample set regularly updates model).Information is carried out with third time and judges that the process determining is: will be through step (1)-(4) The title of information after process, parsing, content word (selecting and instances of ontology title, the word of attributes match) are comprehensively one Rise: according to administrative place name (referring to province, city etc.), area code, postcode, referred to as, showplace (mountain, lake, sea, river, island, building Deng) five classifications carry out sorting out five vectors of composition (wherein in vector, term weighing is term frequencies it is contemplated that title word Importance, is multiplied by pre-determined multiple to the weight of title word), and respectively this five vectors are adopted with the five of aforementioned foundation Individual region decision model carries out detection and judges, and the result that detection is judged is weighted that (flexible strategy are according in info web In each classification, word frequency sum determines divided by the method for word frequency sum in five classifications), if weighing computation results It is more than the threshold value being previously set, then this information can be included into this region;Otherwise, then this information can not be included into this region (as Fig. 5 institute Show).
4th, object information finds
The object information discovery of info web be object type identification, that is, determine info web described by content and which kind of Object is about (and relevant with which kind of event factor, caused which kind of consequence) etc..Its objective is with reference to discovery in info web Area information, object information etc. uniquely determine event as far as possible.
For this reason, considering the problems such as accuracy of identification, amount of calculation and operability, during take regression analyses Method carry out.The range of information adopting in method, is that the message header of each webpage and content combine, and carries out Participle, remove stop words, dimensionality reduction after to form the characteristic vector (as independent variable) of this webpage-wherein term weighing be term frequencies, In view of the importance of title word, pre-determined multiple is multiplied by the weight of title word;Likewise, to and body in right As, the term weighing of result, association person's instance name, attributes match is multiplied by pre-determined multiple.For each object type, The characteristic vector data of above-mentioned webpage is substituted into corresponding logistic regression model (in advance to need species and the foundation distinguished Sample set based on set up model) in, judged according to Regression Analysis Result, this info web whether with this object type There is relation.
Wherein, the method for building up of regression analysis model is: based on the info web sample set arranging, marking (in advance Set up and regularly update), the title of message sample, content word are combined and carry out participle, go after stop words, dimensionality reduction Forming characteristic vector (as independent variable)-wherein term weighing is term frequencies it is contemplated that the importance of title word, to title The weight of word is multiplied by pre-determined multiple;Likewise, to and body in object, result, association person's instance name, attribute The term weighing joined is multiplied by pre-determined multiple;Object type belonging to info web is labeled simultaneously (1 represent belong to This object type, 0 expression are not belonging to this object type, as dependent variable), based on this pin is set up using logistic method Regression analysis model to each object type.
5th, trend tracking, event early warning
From the point of view of practice, in conjunction with the area information finding in abovementioned steps, object type information etc., you can align True determination event (i.e. the common factor with belonging to the information of above-mentioned two dimension represents the related information of event).
On the basis of the region of info web and object type key element identify, set up the characteristic parameter-tool of expression event The employing of the body information page number related with event, page browsing number, the page forward number, specific website page browsing number, specific Under domain name, website page browsing number and aggregative index (are obtained by the method summary parameter of weighting, flexible strategy pass through Dare Philippine side method determines, but need to ensure that flexible strategy sum is 1) etc. represent the feature of event, and regular (such as every 1 hour) joins to feature Number carries out calculating process.And the change according to the time, the situation of change of these affair character parameters of comprehensive analysis.
On the basis of above-mentioned event trend is followed the trail of, each characteristic parameter of periodically (such as every 12 hours) calculating expression events (inclusion aggregative index) numerical value, and the average in each characteristic ginseng value current for event and its regular period previous (is examined at present Consider the feature of network event propagation, have selected one month as calculating cycle, also can be adjusted according to situation) it is compared, If difference is just and absolute value is more than certain threshold value (such as 3 times of standard deviation, threshold value is previously set), then part enters as to this Row early warning initializes.
Afterwards this is carried out with the initialized event of early warning be tracked, calculated expression event within periodically (such as every 12 hours) Each characteristic parameter (inclusion aggregative index) numerical value, and by each characteristic ginseng value current for event and its regular period previous Average (is presently contemplated that the feature that network event is propagated, selects one month before early warning initialization as calculating cycle, also may be used It is adjusted according to situation) it is compared, if difference continues to be more than certain threshold value (such as 3 in (such as 24 hours, be determined in advance) Standard deviation again, threshold value is previously set), then part carries out formal early warning (as shown in Figure 6) as to this.Otherwise cancel part as to this Early warning Initialize installation.
Wherein threshold value determination method is: in history (in such as 1 year) the delta data base of each characteristic parameter of Collection Events On plinth, and combination is passed through the data such as the time of origin of history food safety affair confirming, region, scale and (can be pacified from food Total correlation administration section obtains), calculate the average of (such as month) in each characteristic ginseng value of event and its regular period previous Between difference form variable-as independent variable, would indicate that whether special properties food safety affair occurs (1 represent occur, 0 Represent and do not occur) variable as dependent variable, using the method for logistic regression analyses set up above-mentioned independent variable, dependent variable it Between regressive prediction model.Based on this model, the historical variations trend characteristic of binding events characteristic parameter, select so that because Variate-value be 1 suitable argument value as threshold value.
6th, event terminates to judge
The event of alignment type early warning, on the basis of above-mentioned event trend is followed the trail of, periodically (such as every 12 hours) computational chart Show each characteristic parameter (inclusion aggregative index) numerical value of event, and by each characteristic ginseng value current for event and its previous one regularly Average in phase (is presently contemplated that the feature that network event is propagated, have selected and start to start day to calculating the previous day day from early warning Till as calculating cycle, also can be adjusted according to situation) be compared, if difference is negative and absolute value is more than certain threshold Value (such as 3 times of standard deviation, threshold value is previously set) is then it is assumed that this event terminates.Terminate the early warning of part as to this.
7th, the prediction of target area event risk, early warning and displaying (as shown in Figure 7)
In the case that particular event occurs in some regions, periodically calculate target area (currently not occurring) and this thing occurs The probability of part and possible time of origin, and carry out the early warning of different stage according to the result of analytical calculation.Calculate target Model (regularly updating) process of foundation before the probability of region generation particular event and possible time of origin is:
Select the region (such as provincial region Hebei, Henan etc.) with administrative grade with target area (such as Beijing), Collect these regions (containing target area, if sum for r) through confirmation the time of origin of history food safety affair, region, On the basis of the data such as scale (can obtain from food safety regulatory authorities), formed certain food security incident where, The data acquisition system when occurring.Based on this, the difference of particular event whether is occurred to set up network according to a region, figure Summit be above-mentioned regional, food safety affair, if a region there occurs particular event, above-mentioned zone, thing Produce a side between the summit of part mark, and the weight on side is the number of times that this kind of situation occurs.Further, network is turned It is changed to the matrix a (be previously formed and regularly update) of a r*s (r is number of regions, and s is food safety affair number).
Meanwhile, there is the generation between the time of particular event and the region of earliest generation corresponding event according to target area The difference of the difference of time, set n time range (can set 5 time periods, such as target area generation particular event when Between distance occur earliest the time of this event be in 1 day, in 3 days, in 1 week, 2 weeks interior, 5 time periods in January), respectively to former The data acquisition system beginning is labeled (indicating whether particular event occurs in each region in the above-mentioned time period respectively), respectively shape Become n (in the case of setting 5 time periods, forming 5 data acquisition systems) data acquisition system (be previously formed and regularly update).Here On the basis of, whether target area in data acquisition system be there is in above-mentioned time range particular event as dependent variable (1 expression Occur, 0 expression does not occur), whether remaining region there is corresponding event as independent variable (1 represents generation, and 0 expression does not occur), (5, use c to the regressive prediction model being set up between above-mentioned independent variable, dependent variable using the method for logistic regression analyses1、c2、 c3、c4、c5Represent, be previously formed and regularly update).
On this basis, the process of the probability and possible time of origin that calculate target area generation particular event is:
According to the current region difference that particular event occurs, update the respective element in matrix a, afterwards matrix a is adopted The method of matrix decomposition is processed, and forms new matrix b and (such as adopts svd method, its processing procedure is first by matrix a Carry out singular value decomposition: a=tysydy, wherein tyFor r*f battle array, syFor f*f diagonal matrix, dyFor f*s battle array, f is the order of matrix a;If Determine positive integer k, 0 < k < f, only consider syK maximum singular value of intermediate value, takes s accordinglyyIn corresponding k rank diagonal matrix-be set to sm、tyIn corresponding k arrange-be set to tm、dyIn corresponding k row-be set to dm;Carry out the inverse operation of singular value decomposition, b=afterwards tmsmdm, complete processing procedure).Find the matrix element value of mark target area and particular event dependency in matrix b afterwards, If it greater than the threshold value being previously set, then can determine that target area it may happen that particular event;Otherwise it may be determined that target Region may will not occur particular event.
If particular event can be occurred according to determination target area after above-mentioned judge process, then according to current, spy occurs The region determining event forms the value of each independent variable (1 represents generation, and 0 expression does not occur), and substitutes into above-mentioned regressive prediction model It is analyzed judging, judgement order is according to c5、c4、c3、c2、c1Order carry out successively.Specific practice is if according to c5Sentence Disconnected result is true (can occur), then carry out c4Judgement;If result is false (will not occur, that is, after 1 month it may happen that), Then stop judging.The rest may be inferred, until judged result be false or whole judge to finish, thus obtain target area it may happen that The time of this event (is the time range that last judged result is representated by genuine regressive prediction model, if such as c2 Model is last judged result is genuine model, then the time of origin of measurable target area particular event may be at 1 day Afterwards in 3 days).Thus, the risk that to target area, particular event can occur carries out the early warning of different time rank.
Whether occurring to target area particular event, when it is predicted, on the basis of early warning analysis, will analyze The result obtaining shows user by way of form, figure etc..And provide short message, mail etc. immediately to send the service side sending out Formula.
8th, body supplements and revises
In event information discovery, the whole process of Risk-warning, the food safety affair Information Ontology of structure is to information The performance of the steps such as filtration, INFORMATION DISCOVERY has important impact.Accordingly, it is considered to the changes in distribution feature of internet information, From the angle of constantly improve method efficiency, need periodically the result of the processes such as information filtering, INFORMATION DISCOVERY to be estimated. And the deficiency in body is omitted, mistake etc. is supplemented, is revised, the efficiency follow-up to improve method.
Thus, intactly achieve from the extraction food safety affair information the internet information obtaining that crawls, and according to Event evolution, the event occurrence risk of target area carry out early warning and the overall process for user service in time.During, lead to Cross and take information filtering, area information discovery, object type INFORMATION DISCOVERY, trend to follow the trail of and early warning, risk profile and early warning etc. It is accurate with early warning, risk profile and early warning that technology ensure that event information finds.This will be pre- for the risk for food safety affair Alert, quick emergency processing etc. provides important Information base.
What deserves to be explained is, the present invention cannot be only used for the contingency management of food safety affair, slightly transforms, you can application To others, can obtain from the Internet in the emergency processings such as the Risk-warning of unconventional accident of event information work.

Claims (11)

1. a kind of event occurrence risk prediction towards specific region method for early warning, the steps include:
1) set up a food safety affair Information Ontology, and an add list is set up respectively to each example in body;
2) info web crawling is filtered, obtain the non-junk info web related to food safety affair;
3) represent that the word in place parses in the info web after filtering, obtain accurate place name word;Based on described In food safety affair Information Ontology, the instances of ontology title of region dimension, attribute adopt method for mode matching to the net after parsing Page information is processed, and info web is included into the region that the match is successful;
4) it is directed to the object type of each setting, using regression analysis model, info web is processed, judge each webpage Related object type;
5) according to step 3), 4) object type of the webpage affiliated area determined and its correlation, obtain setting regions, object The info web set of event, sets up the characteristic parameter of event and periodically calculates characteristic ginseng value, if the feature ginseng of certain event The numerical value persistent period exceedes given threshold and then carries out early warning to this event;
6) if a setting object event early warning in certain region, target is periodically calculated based on matrix analyses and regressive prediction model There is the probability of this setting event and possible time of origin in region, and carry out the Risk-warning of different stage.
2. the method for claim 1 is it is characterised in that to the side that parsed of word representing place in info web Method is:
1) for ground nounoun pronoun, judge to whether there is between ground nounoun pronoun and its geographical term of above occurring with judgment models Refer to relation, if it is present ground nounoun pronoun is replaced with corresponding geographical term;
2) it is based on standard word and non-standard word synopsis parses to place name word non-standard in word, by non-standard words Language replaces with standard word;
3) based on the region dimension in described food safety affair Information Ontology, the relative position area information in word is carried out Parsing, obtains accurate place name word;
Wherein, the method for building up of described judgment models is: the info web of inclusively nounoun pronoun is formed a sample set, and right In sample set the relation that refers between nounoun pronoun and the geographical term before it be labeled, as class variable;Set up The characteristic vector of relation between ground nounoun pronoun and the geographical term before it: and then select machine learning method to be based on described sample Set, class variable and characteristic vector set up the judgment models between geographical term and ground nounoun pronoun with the presence or absence of the relation that refers to;
Wherein, judge between ground nounoun pronoun and its geographical term of above occurring with the presence or absence of the method for the relation that refers to be: calculate Between ground nounoun pronoun and geographical term, the characteristic vector value of relation, is sentenced to described characteristic vector value using described judgment models Disconnected, definitely the relation that refers between nounoun pronoun and geographical term whether there is.
3. method as claimed in claim 1 or 2 is it is characterised in that described food safety affair Information Ontology includes object, area Domain, time, result, five dimensions of association person;The content of described add list includes synonym, antonym, another name three dimensions of word; Wherein, for region dimension, the content of adnexa table also include area code, postcode, referred to as, showplace, adjacent domains, place side Six dimensions in position.
4. method as claimed in claim 3 is it is characterised in that step 3) to the word representing place in the info web after filtering Language carries out participle using segmenter to message header and body matter before being parsed, and records participle gained word letter relatively Breath title starts with the text that body matter is constituted, the relative position that terminates, affiliated sentence, the phase of relative sentence beginning and end To position.
5. it is characterised in that initially setting up a dubiously noun list, record can be used as other names to method as claimed in claim 4 The place name claiming, then with described dubiously noun list to step 3) participle gained word mates, and filters the word of coupling;Its In, if the word of coupling has the suffix representing place name, retain this word.
6. method as claimed in claim 2 is it is characterised in that parse to the ground nounoun pronoun in expression place in info web Method be:
61) length setting up a pronoun parsing is the sliding window of l;
62) selectively whether there is geographical term, if it is present being sentenced using judgment models in l word before nounoun pronoun Disconnected, if there is the relation that refers to, then determine the corresponding geographical term of pronoun according to referring to relation, parsing terminates, and is otherwise walked Rapid 63);
63) selectively whether there is geographical term, if it is present being sentenced using judgment models in 2l word before nounoun pronoun Disconnected, if there is the relation that refers to, then determine the corresponding geographical term of pronoun according to referring to relation, parsing terminates, and is otherwise walked Rapid 64);
64) true using the method extracting or replace according to the information source obtaining in metadata extraction process or website location Surely nounoun pronoun refer to place name.
7. the method as described in claim 2 or 6 is it is characterised in that the component bag of sampling feature vectors in described judgment models Include: geographical term suffix lengths, geographical term and ground the distance between nounoun pronoun, geographical term apart from text start relative away from From, the relative distance that text starts with a distance from nounoun pronoun, geographical term start apart from sentence relative distance, nounoun pronoun distance Relative distance that relative distance that sentence starts, geographical term terminate apart from sentence, nounoun pronoun apart from sentence terminate relative Distance.
8. method as claimed in claim 6 is it is characterised in that in step 62) in, if presence in l word before ground nounoun pronoun Multiple geographical terms referring to relation establishment, then the nearest geographical term of chosen distance ground nounoun pronoun;In step 64) in, if There are multiple geographical terms referring to relation establishment, then the nearest ground of chosen distance ground nounoun pronoun in 2l word before ground nounoun pronoun Reason noun.
9. the method for claim 1 is it is characterised in that combine the message header of each webpage and content, and Carry out participle, remove stop words, dimensionality reduction after form the characteristic vector of this webpage, using the characteristic vector of webpage as regression analysis model Independent variable webpage is processed, judge whether it related to object type.
10. the method for claim 1 is it is characterised in that step 5) method that carries out early warning to this event is: periodically counts Calculate the numerical value of described characteristic parameter, and the average in each characteristic ginseng value current for event and its regular period previous is compared Relatively, if difference is just and absolute value is more than certain threshold value it is determined that part carries out early warning initialization as to this;Pre- to having carried out The event of alert Initialize installation, continues periodically to calculate the numerical value of described characteristic parameter, and by each characteristic ginseng value current for event It is compared with the average in its regular period previous, if difference is just and absolute value is persistently more than certain threshold value, right This event carries out formal early warning;The characteristic parameter of described event includes: the information page number related to event, page browsing number, The page forwards the synthesis of website page browsing number and above-mentioned parameter under number, the page browsing number setting website, setting domain name to refer to Number.
11. methods as described in claim 1 or 2 or 10 are it is characterised in that described step 6) implementation method be:
11) select with target area the historical event information set with the region of administrative grade, based on this historical event information collection Build vertical event network jointly;Wherein, the summit mark regional of event network, food safety affair, if a region There occurs a certain event, then identify one side of generation between the summit in this region and the summit identifying this event, and the power on side The number of times again occurring for this event;
12) this event network is converted to the matrix a of a r*s;Wherein, r is number of regions, and s is food safety affair number;
13) based on above-mentioned historical event information set, occur setting incident distance that this event occurs earliest according to target area Time different, set n time range, respectively this historical event information set be labeled for each time range, Form n data acquisition system;
14) to above-mentioned each data acquisition system, whether target area be there is in corresponding time range setting event as because becoming Amount, whether remaining region there is corresponding event as independent variable, using regression analysis set up respectively independent variable, dependent variable it Between regressive prediction model;
15) update the respective element in matrix a, matrix a is processed using matrix disassembling method, form new matrix b;
16) mark target area and the matrix element value setting event correlation in matrix b are found, if it greater than being previously set Threshold value it is determined that target area is it may happen that this setting event;Otherwise, this setting event will not occur;
17) if it is determined that target area future can occur this setting event, then obtained according to the current region that this setting event occurs To the value of independent variable, substituting into above-mentioned regressive prediction model and judged, target area being obtained according to judged result it may happen that setting Determine the temporal predictive value of event;
18) according to above-mentioned risk profile result, risk setting event occurring to target area carries out the early warning of different stage.
CN201210501874.6A 2012-11-29 2012-11-29 Event occurrence risk prediction and early warning method targeted to specific zone Active CN103854064B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210501874.6A CN103854064B (en) 2012-11-29 2012-11-29 Event occurrence risk prediction and early warning method targeted to specific zone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210501874.6A CN103854064B (en) 2012-11-29 2012-11-29 Event occurrence risk prediction and early warning method targeted to specific zone

Publications (2)

Publication Number Publication Date
CN103854064A CN103854064A (en) 2014-06-11
CN103854064B true CN103854064B (en) 2017-01-25

Family

ID=50861694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210501874.6A Active CN103854064B (en) 2012-11-29 2012-11-29 Event occurrence risk prediction and early warning method targeted to specific zone

Country Status (1)

Country Link
CN (1) CN103854064B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105068989B (en) * 2015-07-23 2018-05-04 中国测绘科学研究院 Place name address extraction method and device
CN108320256A (en) * 2017-12-08 2018-07-24 中国电子科技集团公司电子科学研究院 Social security events recognition methods, equipment and storage medium based on big data
CN108959368A (en) * 2018-05-22 2018-12-07 深圳壹账通智能科技有限公司 A kind of information monitoring method, storage medium and server
CN108874642A (en) * 2018-05-25 2018-11-23 平安科技(深圳)有限公司 SQL method for monitoring performance, device, computer equipment and storage medium
CN109146223B (en) * 2018-06-14 2021-11-30 中国地质大学(武汉) Land utilization transformation management and control system
CN110544013B (en) * 2019-07-31 2024-03-05 平安科技(深圳)有限公司 Disaster risk early warning method and device, computer equipment and storage medium
CN110781204B (en) * 2019-09-09 2024-02-20 腾讯大地通途(北京)科技有限公司 Identification information determining method, device, equipment and storage medium of target object
CN113051315B (en) * 2021-03-26 2022-08-19 中国气象局公共气象服务中心(国家预警信息发布中心) Information quantity calculation system for emergency early warning information
CN113392582B (en) * 2021-06-03 2022-03-08 中国科学院国家空间科学中心 Similar recommendation method and system for space environment events of coronal mass ejection
CN113610356A (en) * 2021-07-16 2021-11-05 如东信息技术服务(上海)有限公司 Airport core risk prediction method and system
CN114066077B (en) * 2021-11-22 2022-09-13 哈尔滨工业大学 Environmental sanitation risk prediction method based on emergency event space warning sign analysis
CN114565196B (en) * 2022-04-28 2022-07-29 北京零点远景网络科技有限公司 Multi-event trend prejudging method, device, equipment and medium based on government affair hotline

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7602281B2 (en) * 2006-01-26 2009-10-13 The United States Of America As Represented By The Secretary Of The Army System and method for tactical distributed event warning notification for individual entities, and computer program product therefor

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method

Also Published As

Publication number Publication date
CN103854064A (en) 2014-06-11

Similar Documents

Publication Publication Date Title
CN103854064B (en) Event occurrence risk prediction and early warning method targeted to specific zone
CN103854063B (en) A kind of prediction of event occurrence risk method for early warning based on internet opening imformation
CN103853700B (en) A kind of event method for early warning found based on region and object information
CN103176981B (en) A kind of event information excavates and the method for early warning
CN103176983B (en) A kind of event method for early warning based on internet information
CN103853738B (en) A kind of recognition methods of info web correlation region
Bozarth et al. Toward a better performance evaluation framework for fake news classification
CN106598944B (en) A kind of civil aviaton&#39;s security public sentiment sentiment analysis method
Jin et al. News credibility evaluation on microblog with a hierarchical propagation model
CN110516067A (en) Public sentiment monitoring method, system and storage medium based on topic detection
Zhuang et al. An intelligent anti-phishing strategy model for phishing website detection
Brynielsson et al. Analysis of weak signals for detecting lone wolf terrorists
CN104899508B (en) A kind of multistage detection method for phishing site and system
US20140040301A1 (en) Real-time and adaptive data mining
CN104077396A (en) Method and device for detecting phishing website
CN103793503A (en) Opinion mining and classification method based on web texts
CN107609103A (en) It is a kind of based on push away spy event detecting method
CN103176984B (en) Duplicity rubbish suggestion detection method in a kind of user-generated content
CN110457404A (en) Social media account-classification method based on complex heterogeneous network
CN103544436A (en) System and method for distinguishing phishing websites
CN103092975A (en) Detection and filter method of network community garbage information based on topic consensus coverage rate
CN102170447A (en) Method for detecting phishing webpage based on nearest neighbour and similarity measurement
CN109492097B (en) Enterprise news data risk classification method
CN109145301A (en) Information classification approach and device, computer readable storage medium
CN106446124A (en) Website classification method based on network relation graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant