CN103854063B - A kind of prediction of event occurrence risk method for early warning based on internet opening imformation - Google Patents

A kind of prediction of event occurrence risk method for early warning based on internet opening imformation Download PDF

Info

Publication number
CN103854063B
CN103854063B CN201210501872.7A CN201210501872A CN103854063B CN 103854063 B CN103854063 B CN 103854063B CN 201210501872 A CN201210501872 A CN 201210501872A CN 103854063 B CN103854063 B CN 103854063B
Authority
CN
China
Prior art keywords
information
sample
event
pronoun
info web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210501872.7A
Other languages
Chinese (zh)
Other versions
CN103854063A (en
Inventor
杨风雷
黎建辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Priority to CN201210501872.7A priority Critical patent/CN103854063B/en
Publication of CN103854063A publication Critical patent/CN103854063A/en
Application granted granted Critical
Publication of CN103854063B publication Critical patent/CN103854063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of prediction of event occurrence risk method for early warning based on internet opening imformation.The present invention is:1) rubbish filtering is carried out to info web;2) word to representing place in the info web after filtration is parsed, and obtains place name word;The info web after parsing is processed based on built Information Ontology, info web is included into into matching area;3) info web is filtered, obtains the info web related to food security;Then the info web after filtration is processed using regression analysis model, judges the related object type of each info web;4) determine the info web set of setting regions, object event, set up affair character parameter regular calculating parameter value, early warning is carried out to the event if the characteristic ginseng value of certain event exceedes given threshold;5) different early warning are carried out to the risk that target area occurs setting event based on matrix analysis and regressive prediction model.The present invention improves the efficiency of Risk-warning.

Description

A kind of prediction of event occurrence risk method for early warning based on internet opening imformation
Technical field
The invention belongs to areas of information technology, more particularly to a kind of specific place is carried out to crawling the internet information for obtaining Reason, the method for being predicted and carrying out early warning to the risk that specific region occurs particular event afterwards are mainly used in food peace In the emergency processing work of the unconventional accidents such as full information monitoring, Risk-warning.
Background technology
In recent years, food safety affair toxic capsule, twice-cooked stir-frying oil, clenbuterol hydrochloride, dyeing steamed bun, plasticiser, malicious cucumber etc. Again and again occur, this had both caused extremely bad social influence, also brought substantial amounts of economic loss.In order to avoid or to greatest extent The harm brought by these food safety affairs is reduced, starts to have obtained very big concern based on the Risk-warning technology of event.For The Risk-warning based on event is carried out, this is accomplished by the information for finding these events in advance.
With the fast development of Internet, internet netizen's quantity is more and more huger, and internet is increasingly becoming netizen and sends out Cloth information, the main carriers for obtaining information and transmission information, and define one by the interaction between people, tissue etc. and show There is the virtual society of certain correspondence, incidence relation in real society.It has had changed into worldwide largest common data source, And its scale also ceaselessly increases.Under this situation, the characteristics of using internet itself, it is established that perfect society's letter Breath feedback network, finds various " possible trouble " factors that may bring crisis in advance, and the contingency management for food safety affair is provided In time, accurately, comprehensive information just seems imperative and has very important meaning.
From from the point of view of reality, it is noted that before most food safety affairs occur, always had one on the internet A little fragmentary clues, for this purpose, can take collection, after the relevant information analyzed on internet in the way of early warning as these food The contingency management of security incident provides direct information source.It is specific mutual for timely, accurate, Overall Acquisition, required for grasping Networking target information, it is necessary to use the correlation techniques such as internet information analysis and early warning.
Than the information such as where in Risk-warning, but the research to collecting is carried out if any research work using internet information Which kind of measure is reason, take, be required for artificial participation and decision.Also research work is automatically based on internet information pin Food safety risk early warning is carried out to additive and replenishers etc., but which comes with some shortcomings:During do not account for information Quality problems, the junk information for gathering is not carried out filtering-this can affect the accuracy of early warning;In event information discovery procedure The classification information obtained after Keywords matching be there may be into information pair as in the way of same event information directly The main body answered is inconsistent etc..From in terms of actual test result, the aspect such as its information classification, the accuracy of early warning, comprehensive is remained Where needing further to improve.
Meanwhile, after finding that method extracts the event information that relevant range occurs by event information, if can be right The risk that specific region (currently not occurring) occurs particular event is predicted, i.e., whether this kind of event can occur to which, and How long it is predicted and early warning it may happen that waiting afterwards, the Risk Monitoring to specific region and early warning is had into extremely important Meaning.By consulting literatures, such research is not yet found.
The content of the invention
It is above-mentioned to solve the problems, such as, it is an object of the invention to provide a kind of content for taking particular step to info web It is analyzed, the method that simultaneously early warning is predicted to the risk that specific region occurs particular event afterwards.Intelligence is used for reference in method The step of systems approach, formation, is as described below.
1. info web is crawled
The info web in information source is carried out from internet information reptile software (such as Heritrix, Nutch etc.) Crawl, the internet web page letter required for technology is obtained as far as possible such as during crawling, crawl based on limited range, vertically crawl Cease and preserved.
2. garbage information filtering
To improve the information quality in subsequent processes, rubbish filtering is carried out to crawling the info web for obtaining. Mainly to by the unrelated suggestion in content, the junk information of link cheating and user-generated content, low in filter process Quality suggestion and duplicity rubbish suggestion are filtered by the detection model set up.So as to ensure the information into subsequent process Quality.
3. area information finds
On the basis of above-mentioned garbage information filtering, the title of the info web to crawling, content etc. carry out ground nounoun pronoun etc. After parsing, take pattern match, judge that the discovery of row information relevant range is entered in recognition methods based on machine learning judgment models It is determined that.
4. zone issue early warning
Information is carried out food security information filtration, object information find after, set up represent zone issue feature ginseng On the basis of number page number, page browsing number, composite index etc., by the method pair for periodically calculating affair character parameter value The development trend of event is tracked;And the average in each characteristic ginseng value and its regular period previous current to event is carried out Relatively, if difference is just and absolute value is persistently more than certain threshold value, carry out zone issue early warning.
5. the prediction of target area event occurrence risk and early warning
Based on the area distribution for having occurred and that particular event, using matrix decomposition and the method for logistic regression analyses, Whether particular event and possible time of origin etc. can occur to target area to be analyzed and predict, and according to predicting the outcome Carry out different Risk-warnings.
6. result shows and services
Whether occurring to target area particular event, when be predicted, on the basis of early warning analysis, will analysis The result for obtaining shows user by way of form, figure etc..And provide short message, mail etc. send immediately send out method of service
The present invention is carrying out subsequent treatment to crawling the internet information for obtaining to improve the degree of accuracy that event information finds Before, garbage information filtering process has been carried out to which first.
When the present invention is in order to ensure to set up duplicity rubbish suggestion detection model, the representativeness of sample, has initially set up suggestion Characteristic vector for subregion of the information based on distribution of content, and the method using cluster carries out subregion to argument information, afterwards In each subregion, the method for random sampling is adopted to obtain for setting up the sample of model, it is ensured that the representativeness of sample.
To set up duplicity rubbish suggestion detection model, in sample drawn characteristic procedure, the method for employing is the present invention: First to each Sample Establishing based on content, the initial characteristicses vector of link;Find afterwards and P most like sample of a certain sample This, classification logotype based on this P sample and obtains the final characteristic vector of the sample with the Similarity value of the sample;According to this Circulation obtains the final characteristic vector of each sample.Characteristic vector combines classification of content, link and similar sample etc., protects The comprehensive, complete of sample characteristics extraction is demonstrate,proved.
The present invention in duplicity rubbish suggestion detection process is carried out using model to argument information, with argument information and each Weight coefficient, testing result of the comprehensive each Subarea detecting model to argument information, aggregative weighted are set up based on the distance of subregion Obtain final testing result.Ensure that the degree of accuracy of testing result.
The present invention carries out pre- place to info web first in order to improve the degree of accuracy that the identification of info web relevant range judges Correlation word after reason to being probably place name carries out related resolution to obtain clear and definite word, afterwards by pattern match and judgement The modes such as model judgement judge whether information can be included into target area, thereby determine that info web relevant range.
The present invention employs the pattern for heading message in info web relevant range judges determination process successively The method judged by method of completing the square, the method for mode matching for text message, the judgment models based on machine learning carries out letter The judgement of breath relevant range.Wherein, in the method judged based on the judgment models of machine learning, by integrated region Judgment models enter the judgement of row information relevant range, it is to avoid of the same name, brought with word contrary opinion (such as generally word is used as place name) etc. The inaccurate problem of region decision.
The present invention is in object information discovery procedure, based on the prior regression analysis model set up, title to information, interior Hold etc. carry out the steps such as participle, dimensionality reduction after carry out regression analysis for each object type, with this determine info web respectively with which A little object types have relation.
The relation periodically calculated between each characteristic ginseng value of event and the average in the range of its certain hour previous of the invention, When difference is for just and (such as 3 times of standard deviations) carry out timely event early warning when absolute value is lasted up to a certain extent.
The present invention periodically calculates its each characteristic ginseng value to the event of early warning, and by event current each characteristic ginseng value and In its regular period previous, the average of (from early warning day) is compared, if difference is negative and absolute value is more than certain threshold Value, then terminate the early warning for this event.
The present invention based on the area distribution for having occurred and that particular event, using matrix decomposition and logistic regression analyses Whether method, can occur particular event and possible time of origin etc. to target area and be analyzed and predict, and according to pre- Surveying result carries out different Risk-warnings.
Compared with prior art, advantages of the present invention:
The present invention is by taking garbage information filtering, area information to find to crawling the internet information for obtaining, object letter Breath finds, simultaneously the technology such as early warning is processed for the tracking of the trend of zone issue and early warning, risk profile, it is ensured that food security thing Part INFORMATION DISCOVERY and early warning, the accuracy of target area event occurrence risk prediction and early warning and comprehensive, it is ensured that food is pacified The efficiency of full Risk-warning.
Description of the drawings
A kind of method flow diagrams of the prediction of event occurrence risk early warning based on internet opening imformation of Fig. 1;
Fig. 2 duplicity rubbish suggestion detection method schematic diagrames;
The recognition methods flow chart of Fig. 3 info webs correlation region;
Fig. 4 zone issue method for early warning schematic diagrames;
The event risk prediction of Fig. 5 target areas, method for early warning schematic diagram.
Specific embodiment
The specific embodiment of the present invention is as shown in figure 1, concrete steps are described below.
1. info web is crawled
The info web in information source is carried out from internet information reptile software (such as Heritrix, Nutch etc.) Crawl, the internet web page letter required for technology is obtained as far as possible such as during crawling, crawl based on limited range, vertically crawl Cease and preserved.
2. garbage information filtering
With the development of internet, the webpage quantity of internet and inner capacities are more and more.But meanwhile, the rubbish in webpage Information is also more and more, is the accurate of guarantee follow-up process, it is necessary to carry out garbage information filtering.Garbage information filtering ring The web spam page can be specifically divided into filter and rubbish suggestion two aspects of filtration in user-generated content in section.Wherein, The web spam page can be divided into the content cheating page, the link cheating page;Rubbish suggestion is of different sizes according to its negative effect, can It is classified as insincere suggestion, low quality suggestion, unrelated suggestion.Insincere suggestion, that is, fraudulent suggestion, one side table It is now that specific object, event, personage etc. are given not meet superelevation evaluation, compliment of actual conditions etc.;On the other hand also may be used Can show as providing specific object, event, personage etc. the ultralow evaluation for not meeting actual conditions, abuse, attack etc..Low-quality Amount suggestion, the general length of this kind of suggestion content are shorter, and its content is probably useful, it is also possible to useless, but due to which Content is not detailed to specific topic/product description, it is impossible to determine very much the meaning of its opinion mining to specific topics/product Justice, therefore it is considered as a kind of rubbish suggestion (for computer).Unrelated suggestion, this kind of suggestion be mainly shown as advertisement or The unrelated content of person and topic.
To the web spam page in a website, the low quality suggestion in user-generated content, unrelated suggestion etc., it is contemplated that Its characteristics of spam is relatively obvious, can extract the content of sample, interior based on the prior sample set through mark set up The feature for holding the latitudes such as distribution, link (needs to carry out info web meta-data extraction, text extraction, participle, sentence before extraction feature Son statistics, paragraph statistics, Anchor Text statistics, link statistics etc. are processed) after set up detection model and detected.With regard to content latitude Feature, the information to extracting that employs in this method carries out participle, removes stop words and (can adopt document through dimensionality reduction Frequency method, information gain method etc.) content feature vector-flexible strategy are formed afterwards for term frequencies;With regard to distribution of content feature, this method In employ the length for heading (number of characters) of information, paragraph number, sentence number, bout length (average), sentence length (average), letter Breath length (number of characters), Anchor Text number, Anchor Text length (number of characters-average) etc. (are set up in model process, feature are returned One change is processed, and process is y=x/ (max+1), and wherein x, y are the characteristic value before and after normalization respectively, and max is that in advance website is believed Maximum in breath set obtained by sample statistics this feature;When before max parameters updating if there is x > max, then x=is taken Max+1, i.e. y=1);Feature with regard to linking latitude, goes out chain number and to account for always go out chain number ratio in this method in the website for employing information Example, the outgoing chain number in website of information account for the Information Number always gone out in chain number ratio, Info Link rubbish page set (building in advance) Account for always go out chain number ratio, the quantity of rubbish page set (build in advance) internal chaining this information accounts for total page number ratio etc..For The feature of above three dimension, based on the prior junk information set set up and non-spam set, formed respectively feature to Measure and take machine learning method (such as SVMs etc.) set up junk information detection model (three, based on update Sample set regularly updates model), can be filtered to freshly harvested information that (information is judged as the rule of junk information afterwards Be then at least two of which model testing result be positive example).
Meanwhile, it is the identification for solving the problems, such as duplicity rubbish suggestion, uses for reference intelligence system thinking, the identification step of formation is such as It is shown in Fig. 2, described in detail below.
(1) suggestion set is produced
The information crawled by internet information reptile software in content information source is generated to a certain specific user, which is carried out Pretreatment (includes the meta-data extractions such as info web author, text extraction, participle, part-of-speech tagging, name entity extraction, sentence Statistics, paragraph statistics, punctuation mark statistics etc.) consumers' opinions information aggregate is formed after step.
(2) duplicity rubbish suggestion mark
In view of duplicity rubbish suggestion purpose be in order to it is unpractical raise or reduce special object such as website, The image of webpage, product, personage etc., specifically shows as providing specific object, event, personage etc. and does not meet actual conditions Superelevation evaluation, compliment etc.;On the other hand it is likely to show as providing specific object, event, personage etc. do not meet reality The ultralow evaluation of situation, abuse, attack etc..Thus set out, it is contemplated that some points that duplicity rubbish suggestion has in practice Cloth feature, takes heuristic to be collected the user-generated content for being probably duplicity rubbish suggestion.Specifically, this mistake Be primarily upon in journey content in user-generated content repeat or the approximate suggestion for repeating, certain hour in the range of issue suggestion amount In the range of suggestion that highest top-N1 author is issued, certain hour, suggestion amount highest top-N2 special object is related Suggestion, the related suggestion of suggestion amount highest top-N3 IP address is issued in the range of certain hour, for special object Suggestion that top-N4 earliest user of cloth suggestion is issued and for the most top-N5 of the suggestion times of revision of special object The suggestion issued by individual user.
According to above-mentioned rule, the argument information to meeting conditions above in consumers' opinions information aggregate is arranged, and is formed Candidate's duplicity rubbish suggestion set.Afterwards, it then follows the principle (standard ensured by duplicity rubbish suggestion sample that it is not excessive to be would rather be scarce True property) and examination & verification confirmation is carried out to the duplicity rubbish suggestion of candidate with reference to modes such as examination & verification, investigations.Two kinds are taken specifically Method is confirmed that one kind is positive confirmation, and one kind is reversely to confirm.It is so-called it is positive confirm, if that is, argument information content and Information in duplicity rubbish suggestion knowledge base describes same part thing, the i.e. information content and duplicity rubbish suggestion knowledge Certain information description in storehouse matches, then be duplicity rubbish suggestion.Data entries in duplicity rubbish suggestion knowledge base increase Plus rule is:For an argument information, through a period of time process or prove afterwards, the information issued by certain user Really fraudulent suggestion, in adding knowledge base.Such as contain trimerization in certain forum someone releases news certain brand milk Cyanamide, but later someone enumerate a variety of reasons illustrate this be it is impossible, afterwards prove the latter be the interior of certain brand milk company Caused by the deception of clerks or staff members in a department's work.Thus can confirm that this argument information is duplicity junk information, (knowledge base is prior in addition knowledge base Build and regularly update).So-called reverse confirmation, i.e., it is existing under normal circumstances, occur this type of information be it is impossible, so as to Duplicity rubbish suggestion is proved from reverse angle.Such as reversely confirm in knowledge base (build in advance and regularly update) Rule is:A certain user id (such as 1 minute) in setting time has been issued more than N (such as 10 to one or more product Bar) bar argument information, then these argument informations that the user is delivered are labeled as into duplicity rubbish argument information.This can be matched Rule an example be:In a certain forum, a certain user id has issued 15 to 3 kinds of different products in the time less than 1 minute Bar evaluation information, from from the point of view of a normal person, this is impossible.Therefore, this user institute is demonstrated from reverse angle The duplicity of these information issued.
The information confirmed by said method is labeled, and forms accurate duplicity rubbish suggestion set, while right The user of duplicity rubbish suggestion is often issued in Jing, that is, is issued the most N number of user of duplicity rubbish suggestion, is added into black name List is in case later stage identification is used;In addition, according to accurate duplicity rubbish suggestion set etc., the abnormality of summary and induction suggestion author Behavior (having issued 15 information etc. for 3 kinds of products in 1 minute than such as above-mentioned user) formation rule, for future use.
Notice and clearly confirm that a suggestion is that non-duplicity rubbish suggestion there is also suitable difficulty (for a letter Breath, it is impossible to be clearly shown to be duplicity rubbish suggestion may also mean that can not explicitly stated its be not duplicity rubbish meaning See), it is contemplated that the factor such as diversity that time, workload and non-duplicity rubbish suggestion are present, here not to non-deception Property rubbish suggestion is labeled.
(3) argument information subregion
Each information in the consumers' opinions information aggregate for formed to step (1), (2), marking, extracting which is used for subregion Characteristic vector, clustered, obtain several information areas, and be calculated the conceptual vector in each region.Calculating process It is as described below.
To each information extraction in argument information set be used for subregion characteristic vector details are provided below:Extract The number of words of argument information, word number, suggestion paragraph number, bout length (average), sentence number, sentence length (average), the first person Pronoun number, second person pronoun number, third person pronoun number, adjective number, adverbial word number, verb number, number of person names, concrete number, machine Structure concrete number, time number, sigh with feeling number, question mark number, title number of words etc. (during feature is normalized, process is y= X/ (max+1), wherein x, y are the characteristic value before and after normalization respectively, and max is that in advance information in consumers' opinions information aggregate is united Maximum obtained by meter this feature;When before max parameters updating if there is x > max, then x=max+1, i.e. y=1 are taken) shape Into normalized partition characteristics vector.
Clustering processing can be carried out to argument information afterwards, can be using pedigree cluster, non-pedigree cluster etc. in cluster process Method completes detailed process.
By this process, the characteristic vector of subregion is used for based on consumers' opinions information, by original consumers' opinions information collection Conjunction is divided into some sub-regions (subregion).The conceptual vector Mark of each subregion is calculated respectivelyi(i is partition number)-pass through The characteristic vector for calculating all information in the subregion is worth to.
(4) argument information sampling
Each consumers' opinions information subregion formed to step (3), carries out sample sampling (sample size is determined in advance).Take The method of random sampling, detailed process are as follows:
If sample number to be extracted is S (being determined in advance), the Information Number of each subregion is Ii, then the sample that each subregion should be extracted Number is Si=S*Ii/∑Ii(this numerical value is general numeral, in the threshold for ensureing that the sample number of each subregion is both greater than previously set Value and meet S=∑ SiOn the basis of, the sample number of each subregion moderately can be adjusted).
Information in each subregion is numbered, method for numbering serial is that, from the beginning of 1, increasing successively has until all information One numbering, if maximum number value is MAXi, S is produced using random function afterwardsiIndividual 1-MAXiBetween random number, then this Si The corresponding argument information of individual random number is the sample that respective partition is drawn.
In sampling process, 10 extractions are carried out according to above-mentioned rule to each subregion, and is cheated in selecting institute's sample drawn Property rubbish suggestion number most once extraction be final sample, to ensure that duplicity rubbish argument information as much as possible can be extracted For sample.
So as to obtain the sampling samples set of each argument information subregion.
(5) the secondary mark of argument information sample
The sample for obtaining is extracted to each subregion, secondary arrangement, mark is carried out, divide into duplicity rubbish suggestion, other Suggestion, so as to the sample of each subregion extraction is formed duplicity rubbish suggestion, without mark argument information sample set.
(6) sample characteristics are extracted
It is to set up detection mould that sample characteristics extraction, algorithms selection etc. are carried out to the sample in each subregion through secondary mark The necessary step of type.It is a very crucial step that wherein sample characteristics are extracted, and comprising the following steps that in this method is described:
A) the sample argument information content first to extracting carries out participle, removes stop words, and (can adopt through dimensionality reduction With document frequency method, information gain method etc.) content feature vector (flexible strategy are term frequencies) Q is formed afterwardsj(j is sample number).
B calculate after) sample argument information chain feature include going out in the website of information chain number accounts for the chain number ratio that always go out, The outgoing chain number in the website of information accounts for the Information Number always gone out in the accurate duplicity rubbish suggestion set of chain number ratio, Info Link and accounts for always Go out chain number ratio, the quantity of accurate duplicity rubbish suggestion set internal chaining information and account for total page number ratio etc., and join every (flexible strategy are determined by statistical analysis method in advance, but must ensure that flexible strategy sum, 1) to be calculated total numerical value, is set to for number weighting Lj
C) finally calculate Mj=Lj*Qj, obtain the initial characteristicses vector of the sign sample argument information based on content, link Mj
D) to some sample Sample in subregion, the initial characteristicses vector based on the sample calculates itself and each subregion In each sample information similar value (adopting cosine function), and each sample information is sorted from big to small according to similar value, is obtained Its similar sample sequence.
E) by the classification logotype of the secondary mark of P before in sequence (in advance by analyze determine) individual sample information, (1 represents and is Duplicity rubbish suggestion sample, -1 indicates without mark sample) and similar value (and the sample between) be multiplied respectively, and form one Vectorial N of the number of latitude for P, as the final characteristic vector of sample Sample.
Circulation step D according to this)-E), the characteristic vector until being calculated all samples.
(7) set up duplicity rubbish suggestion detection model
Each sample argument information is established after characteristic vector, from from the point of view of judging identification duplicity rubbish suggestion, Need at present to select machine learning method to set up detection model.The sample set obtained in noticing said process (step (5)) Include the duplicity rubbish suggestion through marking, and without the argument information of mark, but not through the non-deception of mark Property rubbish suggestion.This means that and can not simply adopt general Supervised machine learning method, because it sets up model needing Positive example to be provided simultaneously with, counter-example set.So we are employed herein the machine of a kind of " from positive example and without labeled data learning " Device learning method-biasing SVM (Liu, B., Y.Dai, X.Li, W.Lee, and P.Yu.Building text classifiers using positive andunlabeled examples.Proceedings of IEEE International Conference on Data Mining, 2003.).
To each subregion, characteristic vector based on above-mentioned calculated sample and selected " from positive example and without mark The machine learning method of data learning ", you can set up detection model (each subregion one of recognition detection duplicity rubbish suggestion Individual model).
(8) duplicity rubbish suggestion detection identification
The duplicity rubbish suggestion detection model of each subregion is set up after finishing, you can internet information reptile is newly crawled User-generated content carry out duplicity rubbish suggestion judgement identification.On the whole, the judgement identification of duplicity rubbish suggestion is pressed Carry out according to three steps:Blacklist identification, reversal of identification, model inspection identification.Blacklist identification is carried out first, to belonging to black name The information that user issues in list, Direct Recognition are duplicity rubbish suggestion;For remaining suggestion, according to the rule that step (2) is concluded Then according to reversely confirming (i.e. it is existing under normal circumstances, it is impossible this type of information occur, so as to demonstrate,prove from reverse angle Bright is duplicity rubbish suggestion) mode be identified, for abnormal suggestion, be identified as duplicity rubbish suggestion;For Remaining suggestion is identified according to the model that step (7) is set up, and identification process is:
The subregion for calculating argument information first (is calculated according to step (3) methods described) with characteristic vector, and calculates suggestion Distance (being obtained by the distance for calculating the characteristic vector and each subregion sign vector of the argument information) d of information and each subregioni (i represents partition number).Thus, flexible strategy e of each Subarea detecting model for this argument information are calculatedi=di/∑di
The detection model that each subregion is respectively adopted to argument information carries out detection identification, obtains testing result Oi(process is Initially set up the initial characteristicses vector of argument information, find the sample similar with argument information afterwards, obtain final characteristic vector- The same step of process (6), the model set up using step (7) afterwards obtain testing result), so as to obtain final testing result O =∑ ei*OiIf final testing result O is more than pre-determined threshold value, is identified as duplicity rubbish to this argument information Suggestion.
The consumers' opinions information of duplicity rubbish suggestion will be identified as through above-mentioned steps, be labeled as taking advantage of according to standard unification Deceiving property rubbish suggestion.
(9) duplicity rubbish suggestion detection model updates
In view of the impact brought by duplicity rubbish suggestion, non-duplicity rubbish suggestion dynamic change, deception is being performed Property rubbish suggestion detection process in be periodically executed detection model and argument information set renewal work (containing max parameters are updated). Process is as described below:
By through step (8) identification, mark duplicity rubbish suggestion according to being audited with step (2) identical mode Confirm, the information confirmed by said method is labeled, and new accurate duplicity rubbish suggestion set is formed (for Jing The user for often issuing duplicity rubbish suggestion, is added into blacklist in case later stage identification is used;Simultaneously to new accurate deception In property rubbish suggestion set, the Behavior law of suggestion author is summarized formation rule, for future use);Also form new nothing simultaneously The consumers' opinions information aggregate of mark.
For new consumers' opinions information aggregate, with conceptual vector in original each subregion (to distinguish, Mark hereini Use MarkoldiMark) closest vector, as initial center, performs cluster process using non-pedigree clustering method, obtains New consumers' opinions information subregion, calculates the conceptual vector Mark of each new subregionnewi, and calculate new, old subregion sign vector Between apart from sum Dis=∑ Distance (Marknewi, Markoldi), if Dis is performed more than the threshold value being previously set Model modification process (same to step (3)-(7)), completes the renewal of model.
After above filtration step, the relative matter of information (non-spam) in follow-up processing procedure is participated in Amount is higher, and this is accurately providing the foundation for follow-up process.
3. area information finds
It is the Risk-warning that food safety affair is carried out using the information on internet, needs are obtained through certain process The related information of event.Wherein, obtain internet information in event relevant range be a very important job-as Basis can determine the generation area of event-this be food safety affair early warning basis, this is accomplished by believing internet web page Content in breath etc. is carried out extracting, analyzes the region associated to determine food safety affair information.Correlation step is as described below (as described in Figure 3):
(1) info web pretreatment
Obtain and info web after filtering to crawling, extract its title, source, author, issuing time, issuing web site The metadata informations such as location are simultaneously preserved, while the body matter for extracting info web is preserved.
To extract info web title, body matter, using segmenter which is carried out based on statistics and dictionary (include according to The body set up according to step 4 (1) forms dictionary of place name) participle (and record word relative information title and body matter is constituted Text start, the relative position that terminates, affiliated sentence, the characteristic parameter such as relative position of relative sentence beginning and end), it Afterwards using based on vocabulary (vocabulary arranges to be formed and be regularly updated in advance, including the word at the same time as name and place name, There are other specific meanings but while be also likely to be word of place name etc.;One city of such as Wuzhong-Ningxia Hui Autonomous Region, while It can be name;One county of Founder-Heilongjiang Province, while can be upright company;Although note that the word ratio for containing specific suffix Not exclude if Wuzhong City) matching process the word that may not be place name is excluded.
(2) nounoun pronoun parsing
There may be in web page title information, text message through participle some represent places pronouns, such as this province, This city, the province etc..Itself cannot directly show exact geographic location as these pronouns are literal, it is therefore desirable to which which is solved Analysis.
1) it is the parsing that carries out ground nounoun pronoun, initially sets up the sliding window of pronoun parsing, sliding window length L is true in advance Fixed (determining such as after the word number distribution situation between analytically nounoun pronoun and its antecedent).
2) after selectively before nounoun pronoun in L word with the presence or absence of rational geographical term (corresponding the Liao Dynasty of such as this province It is peaceful etc., based on the prior rule judgment set up), if it is present using between the geographical term and ground nounoun pronoun of following foundation Judged with the presence or absence of the judgment models of the relation that refers to, if there is the relation that refers to, then pronoun pair is determined according to referring to relation The geographical term answered, parsing terminate (if there is it is multiple refer to relation establishment geographical terms, then chosen distance ground nounoun pronoun most Near geographical term), otherwise carry out step 3).
If 3) there is no rational geographical term in L word or model judge that referring to relation is not present, and selects (without departing from whole sentence, such as identified with fullstop) in 2L word before ground nounoun pronoun with the presence or absence of rational geographical term, such as Fruit is present, then sentenced using the judgment models between the geographical term and ground nounoun pronoun of following foundation with the presence or absence of the relation that refers to It is disconnected, if there is the relation that refers to, then the corresponding geographical term of pronoun is determined according to the relation that refers to, parsing terminates (if there is many The individual geographical term for referring to relation establishment, then the chosen distance ground nearest geographical term of nounoun pronoun), otherwise carry out step 4).
If 4) there is no rational geographical term in 2L word or model judge that referring to relation is not present, basis The information source obtained in metadata extraction process or website location adopt the method for extracting or replacing definitely nounoun pronoun Refer to place name.
The method for building up of judgment models:The info web for compiling inclusively nounoun pronoun etc. forms sample set, and right Geographical term in sample set information in each ground nounoun pronoun and its individual word of 2L (rapid 1) of L length syncs) previous (without departing from Sentence range) between the relation that refers to be labeled, as class variable;To each ground nounoun pronoun in sample set information and its Relation between geographical term (without departing from sentence range) in 2L (rapid 1) of L length syncs) individual word extracts dependency number before According to, set up message sample with regard to this over the ground between nounoun pronoun and geographical term relation characteristic vector:Including geographical term suffix (suffix represents place name or has place name feature, " autonomous region " in such as " Xinjiang Uygur Autonomous Regions ") length (suffix Number of words is divided by text size), geographical term and ground the distance between nounoun pronoun (word number is divided by text size), geographical term distance Relative distance (word number is divided by text size) that text starts, nounoun pronoun start apart from text relative distance (word number divided by Text size), geographical term start apart from sentence relative distance (word number is divided by text size), nounoun pronoun open apart from sentence (word number is long divided by text for the relative distance that the relative distance (word number is divided by text size) of beginning, geographical term terminate apart from sentence Degree), the relative distance (word number is divided by text size) that terminates apart from sentence of nounoun pronoun etc.;Machine learning method is selected afterwards Whether (such as svm) is set up between geographical term and ground nounoun pronoun based on above-mentioned sample set, class variable and characteristic vector Presence refers to the judgment models of relation.
Based on judgment models between nounoun pronoun and geographical term with the presence or absence of referring to the method judged by relation it is over the ground: The related data for extracting relation between geographical term and ground nounoun pronoun first forms characteristic vector, and the data of extraction specifically include ground (word number is divided by text for the distance between reason noun suffix lengths (suffix number of words is divided by text size), geographical term and ground nounoun pronoun This length), geographical term start apart from text relative distance (word number is divided by text size), nounoun pronoun start apart from text Relative distance (word number is divided by text size), (word number is long divided by text for the relative distance that starts apart from sentence of geographical term Degree), the phase that terminates apart from sentence of the relative distance (word number is divided by text size) that starts apart from sentence of nounoun pronoun, geographical term Adjust the distance (word number is divided by text size), the relative distance (word number is divided by text size) that terminates apart from sentence of nounoun pronoun etc.. Be identified judging based on the judgment models of above-mentioned foundation afterwards, and according to judged result definitely nounoun pronoun and geographical term it Between the relation that refers to whether there is.
(3) non-standard words parsing
Some words for representing place are there may be in web page title information, text message through participle and has used some Occur beijing, bj etc. in off-gauge linguistic form, such as Chinese text.In this regard, based on the standard word and non-standard set up The word table of comparisons (is set up in advance and is regularly updated), to off-gauge place name word form by way of being replaced after inquiry Parsed.
(4) relative position parsing
There may be some words for representing place to have used relatively in web page title information, text message through participle The expression way of position, such as southwest China province etc..Likewise, these Expression of language also no clear and definite place name name Claim.To solve this problem, based on the area information instances of ontology and its add list set up in step 4 (1), to these relative positions Area information is inquired about and is parsed, and obtains accurate place name word (such as to southwest China province, with reference to the region set up Information Ontology, first looks for the province title belonging to China, and its place orientation latitude is inquired about to the province belonging to each Add list, the province that all place orientation are southwest is extracted, and is substituted southwest China province accordingly, is completed parsing).
(5) region determines
Enter the determination work of row information associated area by having carried out after pretreatment and related resolution to info web, this During mainly include two steps:Pattern match, machine learning judgment models is respectively adopted and enters sentencing for row information relevant range It is disconnected.
What region determined aims at identification information relevant range, and the discovery for food safety affair information provides region base Plinth.The problems such as considering accuracy, amount of calculation and operability, the method for taking pattern match during this first enter OK.Here need to consider two problems:Range of information, matched rule.With regard to matched rule, based on the area that step 4 (1) is set up Domain information body (i.e. region dimension dimension in body), during it is main consider part body instance name, attribute etc., specifically The method of pattern match is taken to be judged by combining title, attribute of these instances of ontology etc.;The mould taken in method Formula matching concrete grammar includes the modes such as the distance matching between Boolean matching, frequency matched, instance name;Specific mode is selected And specific rules set up by analyzing to Information Statistics after determine (be determined in advance and regularly update).With regard to the choosing of range of information Select, mainly consider title, two latitudes of the information content of information here, it is contemplated that message header and the information content there may be not The situation of matching, is processed to the title of information in concrete processing procedure first, if adopting above-mentioned to the title of information After method for mode matching process, information can be included into currently selected region (such as Beijing), then for the pattern in this region Matching treatment is finished;Quadratic modes are carried out for this region using above-mentioned method for mode matching to the content of the information otherwise With process.The principle that it is not excessive to be would rather be scarce is followed during this, ensures the degree of accuracy of identification judged result as far as possible.
If through above-mentioned pattern matching process, this information cannot be included into a certain region, then using based on machine learning The region decision model that method is set up carries out third time and judges to determine.The process for setting up region decision model in advance is:Based on whole The info web sample set that reason (same to step (1)-(4)), mark (whether being associated with certain region) are crossed (is set up and regular in advance Update), by the title of message sample, content word (select and instances of ontology title, attributes match word) combine- By these words according to administrative place name (referring to province, city etc.), area code, postcode, abbreviation, showplace (mountain, lake, sea, river, island Small island, building etc.) five classifications carry out sorting out five characteristic vectors of composition (wherein in vector, term weighing is term frequencies, it is considered to To the importance of title word, pre-determined multiple is multiplied by the weight of title word).Afterwards, using machine learning method (SVMs etc.) each target area is set up region decision model based on above-mentioned five characteristic vectors (5, based on more New sample set regularly updates model).Third time is carried out to information and judges that the process for determining is:Will be through step (1)-(4) Process, parsing after but cannot be included into the title of some region of information, content word (select and instances of ontology title, attribute The word of matching) combine:According to administrative place name (referring to province, city etc.), area code, postcode, abbreviation, showplace (mountain, Lake, sea, river, island, building etc.) five classifications carry out sorting out five vectors of composition that (term weighing is word frequency wherein in vector Rate, it is contemplated that the importance of title word, is multiplied by pre-determined multiple to the weight of title word), and respectively to this five Vector carries out detection judgement using five region decision models of aforementioned foundation, and the result that detection judges is weighted (flexible strategy are true divided by the method for word frequency sum in five classifications according to word frequency sum in each classification in info web It is fixed), if weighing computation results are more than the threshold value being previously set, this information can be included into this region;Otherwise, then this information is not This region can be included into.
4. zone issue early warning
The step of combined with intelligent systems approach, discovery of design food security area event information early warning, is as shown in figure 4, tool Body is described below.
(1) set up body
In view of the characteristics of food safety affair and event information extract, follow the trail of etc. analysis needs, in food security In the building process of event information body, mainly consider to set up from object, region, time, result, five latitudes of association person.Than Such as object instant food, the classifications such as head product, converted products can be divided into, head product can be divided into the classifications such as veterinary antibiotics again, with this Analogize;Such as result can be divided into the classifications such as pollution, poisoning, and pollution can be divided into the classification such as expired, exceeded again, by that analogy;Than As five classifications can be divided on region populations, it is Asia, Europe, A Feili californias, America continent, ocean respectively Continent;Each classification can be finely divided again, such as Asia can be divided into East Asia, West Asia, South Asia, north Asia, the Central Asia, the southeast Sub- six classifications, by that analogy;Until be categorized into only can not be further divided into, the element (i.e. example) of an as bottom.Other The building process of classification is similar to.Meanwhile, for each example in body, corresponding synonym, antonym, not are established respectively The add lists such as noun;Additionally, for the example in area information body, establish respectively area code, postcode, abbreviation, Showplace (mountain, lake, sea, river, island, building), adjacent domains (the adjacent peer domain in the direction such as east, south, west, north), place orientation (phase For upper level, such as middle part, south etc.) six latitudes add list, in case used in information process.
(2) information filtering
In view of there may be on a website and the incoherent content of predetermined theme situation, in order to improve event information It was found that, the degree of accuracy of early warning, before subsequent treatment is carried out to information, first information is carried out filtering-food security information mistake Filter.
Food security information is filtered, that is, judge whether gathered information belongs to the related information of food security.Here Need to consider two problems:Range of information, filtering rule.With regard to filtering rule, based on the food safety affair information sheet set up Body, during two latitudes of primary consideration and result, specific title by combining the instances of ontology of the two latitudes, Attribute etc. takes the method for pattern match to be filtered;The pattern match concrete grammar taken in method include Boolean matching, The modes such as distance matching, the synonymous antisense matching of instance name, instance name alias match between frequency matched, instance name;Tool (be determined in advance and regularly update) is determined after the mode of body is selected and specific rules are set up by analyzing to Information Statistics.With regard to letter The selection of breath scope, mainly considers title, two latitudes of the information content of information, it is contemplated that message header and the information content here Unmatched situation is there may be, first the title of information is processed in concrete processing procedure, if through believing to title After breath is filtered, information can be included into food security information classification, then this information is disposed;Content otherwise to information Carry out secondary judgement process.
After above filtration step, (instant food is safety-related for the information for participating in follow-up processing procedure Non-spam) relative mass is higher, and this is accurately providing the foundation of processing of follow-up.
(3) object information finds
The object information discovery of info web be object type identification, that is, determine info web described by content and which kind of Object is about (and relevant with which kind of event factor, which kind of consequence caused) etc..Its objective is with reference to finding in info web Area information, object information etc. uniquely determine event as far as possible.
For this purpose, the problems such as considering the accuracy of identification, amount of calculation and operability, during take regression analysis Method carry out.The range of information adopted in method, is that the message header and content of each webpage combine, and carries out Participle, remove stop words, dimensionality reduction after to form the characteristic vector (as independent variable) of the webpage-wherein term weighing be term frequencies, In view of the importance of title word, pre-determined multiple is multiplied by the weight of title word;Likewise, to and body in it is right As, the term weighing of result, association person's instance name, attributes match is multiplied by pre-determined multiple.For each object type, The characteristic vector data of above-mentioned webpage is substituted into into corresponding logistic regression models (in advance to need species and the foundation of differentiation Sample set based on set up model) in, judged according to Regression Analysis Result, this info web whether with this object type There is relation.
Wherein, the method for building up of regression analysis model is:Based on the info web sample set for arranging, marking (in advance Set up and regularly update), after combining the title of message sample, content word and carry out participle, remove stop words, dimensionality reduction It is term frequencies to form characteristic vector (as independent variable)-wherein term weighing, it is contemplated that the importance of title word, to title The weight of word is multiplied by pre-determined multiple;Likewise, to and body in object, result, association person's instance name, attribute The term weighing matched somebody with somebody is multiplied by pre-determined multiple;While the object type belonging to info web is labeled, and (1 expression belongs to This object type, 0 represent and are not belonging to this object type, used as dependent variable), pin is set up using logistic methods based on this Regression analysis model to each object type.
(4) trend tracking, event early warning
From from the point of view of practice, with reference to area information, object type information found in abovementioned steps etc., you can align The event (representing the related information of event with the common factor of the information for belonging to above-mentioned two latitude) that true determination occurs.
On the basis of the region of info web and the identification of object type key element, the characteristic parameter-tool of expression event is set up It is the employing of the body information page number related to event, page browsing number, page forwarding number, specific website page browsing number, specific Under domain name, website page browsing number and composite index (are obtained by the method summary parameter for weighting, flexible strategy pass through Dare 1) etc. Philippine side method determines, but need to ensure flexible strategy sum for the feature that represents event, and periodically (such as every 1 hour) joins to feature Number carries out calculating process.And according to the change of time, the situation of change of comprehensive analysis these affair character parameters.
On the basis of above-mentioned event trend is followed the trail of, periodically (such as per 12 hours) calculates each characteristic parameter of expression event (including composite index) numerical value, and the average in event current each characteristic ginseng value and its regular period previous (is examined at present The characteristics of considering network event and propagate, one month is have selected as calculating cycle, also can be adjusted according to situation) it is compared, If difference is just and absolute value is more than certain threshold value (such as 3 times of standard deviation, threshold value are previously set), then part enters as to this Row early warning is initialized.
Carry out the initialized event of early warning afterwards to be tracked to this, periodically (such as per 12 hours) calculates expression event Each characteristic parameter (include composite index) numerical value, and by event current each characteristic ginseng value and its regular period previous Average (is presently contemplated that the characteristics of network event is propagated, selects month before early warning initialization as calculating cycle, also may be used It is adjusted according to situation) it is compared, if difference continues (such as 24 hours, be determined in advance) more than certain threshold value (such as 3 Standard deviation again, threshold value are previously set), then part carries out formal early warning as to this.The early warning for otherwise cancelling part as to this is initial Change and arrange.
Wherein threshold value determination method is:In history (in such as 1 year) the delta data base of each characteristic parameter of Collection Events On plinth, and combine and (can pacify from food through the time of origin of history food safety affair that confirms, region, the data such as scale Total correlation administrative department obtains), calculate the average of each characteristic ginseng value of event and (such as one month) in its regular period previous Between difference form variable-as independent variable, would indicate that whether special properties food safety affair occurs (1 represent occur, 0 Expression does not occur) variable as dependent variable, using the method for logistic regression analyses set up above-mentioned independent variable, dependent variable it Between regressive prediction model.Based on this model, the historical variations trend characteristic of binding events characteristic parameter, selection can cause because Variate-value be 1 suitable argument value as threshold value.
(6) event terminates to judge
The event of alignment type early warning, on the basis of above-mentioned event trend is followed the trail of, regular (such as per 12 hours) computational chart Show each characteristic parameter (the include composite index) numerical value of event, and by event current each characteristic ginseng value and its previous one regularly Average in phase (is presently contemplated that the characteristics of network event is propagated, have selected from early warning and start to start day to calculating the previous day day Till as calculating cycle, also can be adjusted according to situation) be compared, if difference is negative and absolute value is more than certain threshold Value (such as 3 times of standard deviation, threshold value are previously set), then it is assumed that this event terminates.Terminate the early warning of part as to this.
(7) body is supplemented and is corrected
Find in event information, during the entire process of early warning, the food safety affair Information Ontology of structure to information filtering, The performance of the steps such as INFORMATION DISCOVERY has important impact.Accordingly, it is considered to the changes in distribution feature of internet information is arrived, from persistently The angle of raising method efficiency is set out, and needs are periodically estimated to the result of the processes such as information filtering, INFORMATION DISCOVERY.And to this Deficiency in body is omitted, mistake etc. is supplemented, corrected, the efficiency follow-up to improve method.
5. the prediction of target area event risk and early warning
In the case where particular event occurs in some regions, periodically calculate target area (currently not occurring) and this thing occurs The possibility of part and possible time of origin, and the early warning of different stage is carried out (such as Fig. 5 institutes according to the result of analytical calculation Show).Calculate the model (regularly updating) before target area occurs the possibility and possible time of origin of particular event to set up Process is:
The region (such as provincial region Hebei, Henan etc.) with administrative grade with target area (such as Beijing) is selected, Collect the time of origin of history food safety affair of these regions (containing target area, if sum is R) through confirming, region, On the basis of the data such as scale (can obtain from food security regulatory authorities), formed certain food security incident where, The data acquisition system for when occurring.Based on this, the difference for whether particular event occurring according to a region sets up network, figure Summit be above-mentioned regional, food safety affair, if a region there occurs particular event, above-mentioned zone, thing A side is produced between the summit of part mark, and the weight on side is the number of times that this kind of situation occurs.Further, network is turned It is changed to the matrix A (be previously formed and regularly update) of a R*S (R is number of regions, and S is food safety affair number).
Meanwhile, the generation between the time that particular event occurs according to target area and the region that corresponding event occurs earliest The difference of the difference of time, set N number of time range (can set 5 time periods, such as target area occur particular event when Between distance occur earliest the event time be in 1 day, in 3 days, in 1 week, 2 weeks interior, 5 time periods in January), respectively to original The data acquisition system of beginning is labeled (indicate whether particular event occurs in each region in the above-mentioned time period respectively), respectively shape Into N number of (in the case of 5 time periods of setting, forming 5 data acquisition systems) data acquisition system (be previously formed and regularly update).Here On the basis of, whether target area in data acquisition system in above-mentioned time range particular event occurs, and used as dependent variable, (1 represents Occur, 0 represents do not occur), whether remaining region there is corresponding event as independent variable (1 represents occur, and 0 represents do not occur), (5, use C to the regressive prediction model set up between above-mentioned independent variable, dependent variable using the method for logistic regression analyses1、C2、 C3、C4、C5Represent, be previously formed and regularly update).
On this basis, the process of the possibility and possible time of origin of calculating target area generation particular event is:
It is different according to the current region that particular event occurs, the respective element in matrix A is updated, afterwards matrix A is adopted The method of matrix decomposition is processed, and (such as using svd methods, its processing procedure is first by matrix A to form new matrix B Carry out singular value decomposition:A=TySyDy, wherein TyFor R*F battle arrays, SyFor F*F diagonal matrixs, DyFor F*S battle arrays, orders of the F for matrix A;If Determine positive integer K, 0 < K < F only consider SyK maximum singular value of intermediate value, takes S accordinglyyIn corresponding K rank diagonal matrix-be set to SmTyIn corresponding K arrange-be set to Tm、DyIn corresponding K rows-be set to Dm;The inverse operation of singular value decomposition, B=are carried out afterwards TmSmDm, complete processing procedure).The matrix element value of target area and particular event correlation is identified in finding matrix B afterwards, If it greater than the threshold value being previously set, then target area is can determine that it may happen that particular event;Otherwise, it may be determined that target Region may not occur particular event.
If particular event can occur according to determination target area after above-mentioned deterministic process, then special according to occurring at present The region for determining event forms the value of each independent variable (1 represents occur, and 0 represents do not occur), and substitutes into above-mentioned regressive prediction model Judgement is analyzed, judgement order is according to C5、C4、C3、C2、C1Order carry out successively.Specific practice is if according to C5Sentence Disconnected result is true (can occur), then carry out C4Judgement;If result is false (will not occur, i.e., may occur after 1 month), Then stop judging.The rest may be inferred, until judged result is false or all judges to finish, may occur so as to obtain target area This event time (be last judged result be time range representated by genuine regressive prediction model, if such as C2 Model is that last judged result is genuine model, then the time of origin that can be predicted target area particular event may be at 1 day Afterwards in 3 days).So as to the early warning of different time rank can be carried out to the risk that target area occurs particular event.
6. result shows and services
Whether occurring to target area particular event, when be predicted, on the basis of early warning analysis, will analysis The result for obtaining shows user by way of form, figure etc..And provide short message, mail etc. send immediately send out service side Formula.
Thus, intactly realize from crawling the food safety affair information that extract in the internet information that obtains, and according to Event evolution, the event risk of target area carry out early warning and the overall process for user service in time.During, by adopting Take garbage information filtering, area information discovery, object type INFORMATION DISCOVERY, trend to follow the trail of and early warning, risk profile and early warning etc. It is accurate with early warning, risk profile and early warning that technology ensure that event information finds.This will be the risk for food safety affair pre- Alert, quick emergency processing etc. provides important Information base.
What deserves to be explained is, the present invention cannot be only used for the contingency management of food safety affair, slightly transform, you can application To others, can obtain from internet in the emergency processings such as the Risk-warning of unconventional accident of event information work.

Claims (10)

1. a kind of event occurrence risk based on internet opening imformation predicts and method for early warning that its step is:
1) a food safety affair Information Ontology is set up, and an add list is set up respectively to each example in body;
2) info web to crawling carries out rubbish filtering, obtains non-junk info web;
3) word to representing place in the info web after filtration is parsed, and obtains accurate place name word;Based on described In food safety affair Information Ontology, the instances of ontology title of region dimension, attribute adopt method for mode matching to the net after parsing Page information is processed, and info web is included into the region that the match is successful;
4) info web is filtered, obtains the info web related to food security;Then for the object of each setting Classification, is processed to the info web after filtration using regression analysis model, judges the related object class of each info web Not;
5) according to step 3), the object type of the info web affiliated area 4) determined and its correlation, obtain setting regions, right The info web set of the event of elephant, sets up the characteristic parameter of event and periodically calculates characteristic ginseng value, if the spy of certain event Levying the lasting setting time of parameter value then carries out early warning to the event more than given threshold;
6) if a setting object event early warning occurs in certain region, target is periodically calculated based on matrix analysis and regressive prediction model There is the possibility of the setting object event and possible time of origin in region, and carry out the Risk-warning of different stage;
Wherein, to the method parsed by the word for representing place in info web it is:
A) for ground nounoun pronoun, judge to whether there is between ground nounoun pronoun and its geographical term for above occurring with a judgment models Relation is referred to, if it is present ground nounoun pronoun is replaced with corresponding geographical term;
B) non-standard place name word in word is parsed based on standard word and the non-standard word table of comparisons, by non-standard words Language replaces with standard word;
C) based on the region dimension in the food safety affair Information Ontology, the relative position area information in word is carried out Parsing, obtains accurate place name word;
Wherein, the method for building up of the judgment models is:The info web of inclusively nounoun pronoun is formed into a sample set, and it is right In sample set the relation that refers between nounoun pronoun and the geographical term before which be labeled, as class variable;Set up The characteristic vector of relation between ground nounoun pronoun and the geographical term before which:Then machine learning method is selected to be based on the sample Set, class variable and characteristic vector set up the judgment models between geographical term and ground nounoun pronoun with the presence or absence of the relation that refers to;
Wherein, judge between ground nounoun pronoun and its geographical term for above occurring with the presence or absence of the method for the relation that refers to be:Calculate Between ground nounoun pronoun and geographical term, the characteristic vector value of relation, is sentenced to the characteristic vector value using the judgment models Disconnected, definitely the relation that refers between nounoun pronoun and geographical term whether there is.
2. the method for claim 1, it is characterised in that the duplicity rubbish suggestion in the info web that crawls is carried out The method of filtration is:
21) webpage that selected user generates content information source is crawled, and a consumers' opinions information collection is set up according to the webpage for crawling Close;Consumers' opinions information aggregate is clustered, several information areas is obtained, and is calculated all information in each information area Characteristic vector average, as the conceptual vector of the information area;
22) sample sampling is carried out to the consumers' opinions information in each information area, obtains the sample set of each information area;
23) sample in the sample set of each information area is labeled, obtains the duplicity rubbish of each information area Suggestion sample set and without mark argument information sample set;
24) to each sample, the P sample most like with which in the sample set of each information area of searching, based on the P sample Classification logotype and its with the Similarity value between the sample, obtain the final characteristic vector of the sample;
25) the final characteristic vector based on each sample, selects machine learning method to set up a deception for each information area Property rubbish suggestion detection model;
26) information in consumers' opinions information aggregate is filtered using duplicity rubbish suggestion detection model.
3. method as claimed in claim 2, it is characterised in that the method for obtaining the sample set of each information area is: First the information to being defined as duplicity rubbish suggestion in the consumers' opinions information aggregate is labeled, and sets up one and accurately cheats Property rubbish argument information set;Then to argument information subregion after, to each subregion according to taking out at random in sample sampling process The method of sample is repeatedly extracted, and according to duplicity rubbish in built duplicity rubbish argument information Resource selection institute sample drawn The most final sample once extracted as the subregion of rubbish suggestion number, obtains the sample set of each information area.
4. method as claimed in claim 2 or claim 3, it is characterised in that to each sample, with the content of sample and link latitude Characteristic parameter forms its initial characteristicses vector, the P sample most like with which in the sample set of each information area of searching.
5. method as claimed in claim 2, it is characterised in that consumers' opinions is believed using duplicity rubbish suggestion detection model Information in breath set carries out, in filter process, setting up weight coefficient based on the distance of argument information and each information area, Each duplicity rubbish suggestion detection model is carried out into aggregative weighted to the testing result of consumers' opinions information, final inspection is obtained Survey result;Consumers' opinions information is labeled according to final testing result.
6. method as claimed in claim 2, it is characterised in that the computational methods of the final characteristic vector of the sample are:
A) the sample argument information content first to extracting carries out participle, removes stop words, and in being formed after dimensionality reduction Hold characteristic vector Qj, j is sample number;
B the chain feature of sample argument information) is calculated, and every chain feature is weighted is obtained total numerical value, if For Lj
C) calculate Mj=Lj*Qj, obtain the initial characteristicses vector M of the sign sample argument information based on content, linkj
D) to information area in each sample Sample, based on the sample initial characteristicses vector, calculate which with each information area The similar value of each sample information in domain, and each sample information is sorted from big to small according to similar value, obtain its similar sample Sequence;
E) classification logotype of front P sample information in sample sequence is multiplied respectively with corresponding similar value, it is P to form a number of latitude Vectorial N, as the final characteristic vector of sample Sample.
7. method as claimed in claim 2, it is characterised in that periodically to the set of accurate duplicity rubbish argument information and without mark The consumers' opinions information aggregate of note is carried out supplementing, is updated, and then the consumers' opinions information aggregate after renewal is clustered, and is calculated The distance between each information area current flag vector and last conceptual vector are simultaneously sued for peace and obtain accumulated value Dis, when Dis values During more than the threshold value being previously set, the duplicity rubbish suggestion detection model of each information area is updated.
8. method as claimed in claim 2, it is characterised in that consumers' opinions information aggregate is carried out the feature of cluster analysis to Measure and be:Extract the number of words of argument information, word number, suggestion paragraph number, bout length average, sentence number, sentence length average, first Personal pronoun number, second person pronoun number, third person pronoun number, adjective number, adverbial word number, verb number, number of person names, place name Number, mechanism's concrete number, time number, sigh with feeling number, question mark number and title number of words, and which is normalized obtains to consumers' opinions Information aggregate carries out the characteristic vector of cluster analysis.
9. the method for claim 1, it is characterised in that the ground nounoun pronoun to representing place in info web is parsed Method be:
91) set up the sliding window of the length for L of pronoun parsing;
92) selectively whether there is geographical term before nounoun pronoun in L word, if it is present being sentenced using judgment models It is disconnected, if there is the relation that refers to, then the corresponding geographical term of pronoun is determined according to referring to relation, parsing terminates, is otherwise walked It is rapid 93);
93) selectively whether there is geographical term before nounoun pronoun in 2L word, if it is present being sentenced using judgment models It is disconnected, if there is the relation that refers to, then the corresponding geographical term of pronoun is determined according to referring to relation, parsing terminates, is otherwise walked It is rapid 93);
94) it is true using the method for extracting or replace according to the information source or website location obtained in metadata extraction process Surely nounoun pronoun refers to place name.
10. the method as described in claim 1 or 2 or 9, it is characterised in that calculate the possibility that target area occurs the setting event Property and possible time of origin, and carry out the method for the Risk-warning of different stage and be:
11) the historical event information set with target area with the region of administrative grade is selected, based on the historical event information collection Build vertical event network jointly;Wherein, the summit mark regional of event network, food safety affair, if a region A certain event is there occurs, then a side, and the power on side are produced between the summit for identifying the region and the summit for identifying the event Weight is the number of times that the event occurs;
12) the event network is converted to the matrix A of a R*S;Wherein, R is number of regions, and S is food safety affair number;
13) based on above-mentioned historical event information set, there is setting incident distance according to target area and the event occurs earliest Time it is different, set N number of time range, respectively the historical event information set be labeled for each time range, Form N number of data acquisition system;
14) to above-mentioned each data acquisition system, whether target area setting event occurs as because becoming in corresponding time range Whether amount, remaining region there is corresponding event as independent variable, using regression analysis set up respectively independent variable, dependent variable it Between regressive prediction model;
15) respective element in matrix A is updated, matrix A is processed using matrix disassembling method, form new matrix B;
16) matrix element value of target area and setting event correlation is identified in finding matrix B, if it greater than being previously set Threshold value, it is determined that target area is it may happen that the setting event;Otherwise, the setting event will not occur;
17) if it is determined that target area future can occur the setting event, then obtained according to the region that the setting event occurs at present To the value of independent variable, substitute into above-mentioned regressive prediction model and judged that target area being obtained according to judged result may set Determine the temporal predictive value of event;
18) according to above-mentioned risk profile result, the early warning of different stage is carried out to the risk that target area occurs setting event.
CN201210501872.7A 2012-11-29 2012-11-29 A kind of prediction of event occurrence risk method for early warning based on internet opening imformation Active CN103854063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210501872.7A CN103854063B (en) 2012-11-29 2012-11-29 A kind of prediction of event occurrence risk method for early warning based on internet opening imformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210501872.7A CN103854063B (en) 2012-11-29 2012-11-29 A kind of prediction of event occurrence risk method for early warning based on internet opening imformation

Publications (2)

Publication Number Publication Date
CN103854063A CN103854063A (en) 2014-06-11
CN103854063B true CN103854063B (en) 2017-04-05

Family

ID=50861693

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210501872.7A Active CN103854063B (en) 2012-11-29 2012-11-29 A kind of prediction of event occurrence risk method for early warning based on internet opening imformation

Country Status (1)

Country Link
CN (1) CN103854063B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123368B (en) * 2014-07-24 2017-06-13 中国软件与技术服务股份有限公司 The method for early warning and system of big data Importance of Attributes and identification based on cluster
CN104156402B (en) * 2014-07-24 2017-06-13 中国软件与技术服务股份有限公司 A kind of normal mode extracting method and system based on cluster
CN104142986B (en) * 2014-07-24 2017-08-04 中国软件与技术服务股份有限公司 A kind of big data Study on Trend method for early warning and system based on cluster
CN106548189B (en) * 2015-09-18 2019-06-21 阿里巴巴集团控股有限公司 A kind of event recognition method and equipment
CN107025596B (en) * 2016-02-01 2021-07-16 腾讯科技(深圳)有限公司 Risk assessment method and system
CN107247742A (en) * 2017-05-17 2017-10-13 武汉工程大学 A kind of text message abstracting method based on web page characteristics
CN110334720A (en) * 2018-03-30 2019-10-15 百度在线网络技术(北京)有限公司 Feature extracting method, device, server and the storage medium of business datum
CN110086829B (en) * 2019-05-14 2021-06-22 四川长虹电器股份有限公司 Method for detecting abnormal behaviors of Internet of things based on machine learning technology
CN110457595B (en) * 2019-08-01 2023-07-04 腾讯科技(深圳)有限公司 Emergency alarm method, device, system, electronic equipment and storage medium
CN113051573B (en) * 2021-02-19 2021-11-02 广州银汉科技有限公司 Host safety real-time monitoring alarm system based on big data
CN113051315B (en) * 2021-03-26 2022-08-19 中国气象局公共气象服务中心(国家预警信息发布中心) Information quantity calculation system for emergency early warning information
CN114565196B (en) * 2022-04-28 2022-07-29 北京零点远景网络科技有限公司 Multi-event trend prejudging method, device, equipment and medium based on government affair hotline
CN117131944B (en) * 2023-10-24 2024-01-12 中国电子科技集团公司第十研究所 Multi-field-oriented interactive crisis event dynamic early warning method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method
JP2010128806A (en) * 2008-11-27 2010-06-10 Hitachi Ltd Information analyzing device
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN102193951A (en) * 2010-03-19 2011-09-21 华为技术有限公司 Information extracting method and system
CN102567393A (en) * 2010-12-21 2012-07-11 北大方正集团有限公司 Method, device and system for processing public sentiment topics
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070088794A1 (en) * 2005-09-27 2007-04-19 Cymer, Inc. Web-based method for information services

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010128806A (en) * 2008-11-27 2010-06-10 Hitachi Ltd Information analyzing device
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN102193951A (en) * 2010-03-19 2011-09-21 华为技术有限公司 Information extracting method and system
CN102567393A (en) * 2010-12-21 2012-07-11 北大方正集团有限公司 Method, device and system for processing public sentiment topics
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
用户生成内容中垃圾意见研究综述;杨风雷 等;《计算机应用研究》;20111031;第28卷(第10期);全文 *

Also Published As

Publication number Publication date
CN103854063A (en) 2014-06-11

Similar Documents

Publication Publication Date Title
CN103854063B (en) A kind of prediction of event occurrence risk method for early warning based on internet opening imformation
CN103854064B (en) Event occurrence risk prediction and early warning method targeted to specific zone
CN103176981B (en) A kind of event information excavates and the method for early warning
Bozarth et al. Toward a better performance evaluation framework for fake news classification
CN103853700B (en) A kind of event method for early warning found based on region and object information
CN103853738B (en) A kind of recognition methods of info web correlation region
CN103853744B (en) Deceptive junk comment detection method oriented to user generated contents
CN105005594B (en) Abnormal microblog users recognition methods
CN105138570B (en) The doubtful crime degree calculation method of network speech data
Kalampokis et al. Combining social and government open data for participatory decision-making
CN110457404A (en) Social media account-classification method based on complex heterogeneous network
CN106940732A (en) A kind of doubtful waterborne troops towards microblogging finds method
CN103176984B (en) Duplicity rubbish suggestion detection method in a kind of user-generated content
CN101394311A (en) Network public opinion prediction method based on time sequence
CN102946331A (en) Detecting method and device for zombie users of social networks
Yamak et al. Detection of multiple identity manipulation in collaborative projects
Petroni et al. An extensible event extraction system with cross-media event resolution
CN107305545A (en) A kind of recognition methods of the network opinion leader based on text tendency analysis
Hofmann et al. The reddit politosphere: a large-scale text and network resource of online political discourse
Ruffo et al. Surveying the research on fake news in social media: a tale of networks and language
Cao et al. Fake reviewer group detection in online review systems
Sharma et al. Going beyond content richness: Verified information aware summarization of crisis-related microblogs
Abu Talha et al. Scrutinize artificial intelligence algorithms for Pakistani and Indian parody tweets detection
Mouty et al. Survey on steps of truth detection on Arabic tweets
Arafat et al. Popularity prediction of online news item based on social media response

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant