CN103854063B - A kind of prediction of event occurrence risk method for early warning based on internet opening imformation - Google Patents
A kind of prediction of event occurrence risk method for early warning based on internet opening imformation Download PDFInfo
- Publication number
- CN103854063B CN103854063B CN201210501872.7A CN201210501872A CN103854063B CN 103854063 B CN103854063 B CN 103854063B CN 201210501872 A CN201210501872 A CN 201210501872A CN 103854063 B CN103854063 B CN 103854063B
- Authority
- CN
- China
- Prior art keywords
- information
- sample
- event
- pronoun
- info web
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of prediction of event occurrence risk method for early warning based on internet opening imformation.The present invention is:1) rubbish filtering is carried out to info web;2) word to representing place in the info web after filtration is parsed, and obtains place name word;The info web after parsing is processed based on built Information Ontology, info web is included into into matching area;3) info web is filtered, obtains the info web related to food security;Then the info web after filtration is processed using regression analysis model, judges the related object type of each info web;4) determine the info web set of setting regions, object event, set up affair character parameter regular calculating parameter value, early warning is carried out to the event if the characteristic ginseng value of certain event exceedes given threshold;5) different early warning are carried out to the risk that target area occurs setting event based on matrix analysis and regressive prediction model.The present invention improves the efficiency of Risk-warning.
Description
Technical field
The invention belongs to areas of information technology, more particularly to a kind of specific place is carried out to crawling the internet information for obtaining
Reason, the method for being predicted and carrying out early warning to the risk that specific region occurs particular event afterwards are mainly used in food peace
In the emergency processing work of the unconventional accidents such as full information monitoring, Risk-warning.
Background technology
In recent years, food safety affair toxic capsule, twice-cooked stir-frying oil, clenbuterol hydrochloride, dyeing steamed bun, plasticiser, malicious cucumber etc.
Again and again occur, this had both caused extremely bad social influence, also brought substantial amounts of economic loss.In order to avoid or to greatest extent
The harm brought by these food safety affairs is reduced, starts to have obtained very big concern based on the Risk-warning technology of event.For
The Risk-warning based on event is carried out, this is accomplished by the information for finding these events in advance.
With the fast development of Internet, internet netizen's quantity is more and more huger, and internet is increasingly becoming netizen and sends out
Cloth information, the main carriers for obtaining information and transmission information, and define one by the interaction between people, tissue etc. and show
There is the virtual society of certain correspondence, incidence relation in real society.It has had changed into worldwide largest common data source,
And its scale also ceaselessly increases.Under this situation, the characteristics of using internet itself, it is established that perfect society's letter
Breath feedback network, finds various " possible trouble " factors that may bring crisis in advance, and the contingency management for food safety affair is provided
In time, accurately, comprehensive information just seems imperative and has very important meaning.
From from the point of view of reality, it is noted that before most food safety affairs occur, always had one on the internet
A little fragmentary clues, for this purpose, can take collection, after the relevant information analyzed on internet in the way of early warning as these food
The contingency management of security incident provides direct information source.It is specific mutual for timely, accurate, Overall Acquisition, required for grasping
Networking target information, it is necessary to use the correlation techniques such as internet information analysis and early warning.
Than the information such as where in Risk-warning, but the research to collecting is carried out if any research work using internet information
Which kind of measure is reason, take, be required for artificial participation and decision.Also research work is automatically based on internet information pin
Food safety risk early warning is carried out to additive and replenishers etc., but which comes with some shortcomings:During do not account for information
Quality problems, the junk information for gathering is not carried out filtering-this can affect the accuracy of early warning;In event information discovery procedure
The classification information obtained after Keywords matching be there may be into information pair as in the way of same event information directly
The main body answered is inconsistent etc..From in terms of actual test result, the aspect such as its information classification, the accuracy of early warning, comprehensive is remained
Where needing further to improve.
Meanwhile, after finding that method extracts the event information that relevant range occurs by event information, if can be right
The risk that specific region (currently not occurring) occurs particular event is predicted, i.e., whether this kind of event can occur to which, and
How long it is predicted and early warning it may happen that waiting afterwards, the Risk Monitoring to specific region and early warning is had into extremely important
Meaning.By consulting literatures, such research is not yet found.
The content of the invention
It is above-mentioned to solve the problems, such as, it is an object of the invention to provide a kind of content for taking particular step to info web
It is analyzed, the method that simultaneously early warning is predicted to the risk that specific region occurs particular event afterwards.Intelligence is used for reference in method
The step of systems approach, formation, is as described below.
1. info web is crawled
The info web in information source is carried out from internet information reptile software (such as Heritrix, Nutch etc.)
Crawl, the internet web page letter required for technology is obtained as far as possible such as during crawling, crawl based on limited range, vertically crawl
Cease and preserved.
2. garbage information filtering
To improve the information quality in subsequent processes, rubbish filtering is carried out to crawling the info web for obtaining.
Mainly to by the unrelated suggestion in content, the junk information of link cheating and user-generated content, low in filter process
Quality suggestion and duplicity rubbish suggestion are filtered by the detection model set up.So as to ensure the information into subsequent process
Quality.
3. area information finds
On the basis of above-mentioned garbage information filtering, the title of the info web to crawling, content etc. carry out ground nounoun pronoun etc.
After parsing, take pattern match, judge that the discovery of row information relevant range is entered in recognition methods based on machine learning judgment models
It is determined that.
4. zone issue early warning
Information is carried out food security information filtration, object information find after, set up represent zone issue feature ginseng
On the basis of number page number, page browsing number, composite index etc., by the method pair for periodically calculating affair character parameter value
The development trend of event is tracked;And the average in each characteristic ginseng value and its regular period previous current to event is carried out
Relatively, if difference is just and absolute value is persistently more than certain threshold value, carry out zone issue early warning.
5. the prediction of target area event occurrence risk and early warning
Based on the area distribution for having occurred and that particular event, using matrix decomposition and the method for logistic regression analyses,
Whether particular event and possible time of origin etc. can occur to target area to be analyzed and predict, and according to predicting the outcome
Carry out different Risk-warnings.
6. result shows and services
Whether occurring to target area particular event, when be predicted, on the basis of early warning analysis, will analysis
The result for obtaining shows user by way of form, figure etc..And provide short message, mail etc. send immediately send out method of service
The present invention is carrying out subsequent treatment to crawling the internet information for obtaining to improve the degree of accuracy that event information finds
Before, garbage information filtering process has been carried out to which first.
When the present invention is in order to ensure to set up duplicity rubbish suggestion detection model, the representativeness of sample, has initially set up suggestion
Characteristic vector for subregion of the information based on distribution of content, and the method using cluster carries out subregion to argument information, afterwards
In each subregion, the method for random sampling is adopted to obtain for setting up the sample of model, it is ensured that the representativeness of sample.
To set up duplicity rubbish suggestion detection model, in sample drawn characteristic procedure, the method for employing is the present invention:
First to each Sample Establishing based on content, the initial characteristicses vector of link;Find afterwards and P most like sample of a certain sample
This, classification logotype based on this P sample and obtains the final characteristic vector of the sample with the Similarity value of the sample;According to this
Circulation obtains the final characteristic vector of each sample.Characteristic vector combines classification of content, link and similar sample etc., protects
The comprehensive, complete of sample characteristics extraction is demonstrate,proved.
The present invention in duplicity rubbish suggestion detection process is carried out using model to argument information, with argument information and each
Weight coefficient, testing result of the comprehensive each Subarea detecting model to argument information, aggregative weighted are set up based on the distance of subregion
Obtain final testing result.Ensure that the degree of accuracy of testing result.
The present invention carries out pre- place to info web first in order to improve the degree of accuracy that the identification of info web relevant range judges
Correlation word after reason to being probably place name carries out related resolution to obtain clear and definite word, afterwards by pattern match and judgement
The modes such as model judgement judge whether information can be included into target area, thereby determine that info web relevant range.
The present invention employs the pattern for heading message in info web relevant range judges determination process successively
The method judged by method of completing the square, the method for mode matching for text message, the judgment models based on machine learning carries out letter
The judgement of breath relevant range.Wherein, in the method judged based on the judgment models of machine learning, by integrated region
Judgment models enter the judgement of row information relevant range, it is to avoid of the same name, brought with word contrary opinion (such as generally word is used as place name) etc.
The inaccurate problem of region decision.
The present invention is in object information discovery procedure, based on the prior regression analysis model set up, title to information, interior
Hold etc. carry out the steps such as participle, dimensionality reduction after carry out regression analysis for each object type, with this determine info web respectively with which
A little object types have relation.
The relation periodically calculated between each characteristic ginseng value of event and the average in the range of its certain hour previous of the invention,
When difference is for just and (such as 3 times of standard deviations) carry out timely event early warning when absolute value is lasted up to a certain extent.
The present invention periodically calculates its each characteristic ginseng value to the event of early warning, and by event current each characteristic ginseng value and
In its regular period previous, the average of (from early warning day) is compared, if difference is negative and absolute value is more than certain threshold
Value, then terminate the early warning for this event.
The present invention based on the area distribution for having occurred and that particular event, using matrix decomposition and logistic regression analyses
Whether method, can occur particular event and possible time of origin etc. to target area and be analyzed and predict, and according to pre-
Surveying result carries out different Risk-warnings.
Compared with prior art, advantages of the present invention:
The present invention is by taking garbage information filtering, area information to find to crawling the internet information for obtaining, object letter
Breath finds, simultaneously the technology such as early warning is processed for the tracking of the trend of zone issue and early warning, risk profile, it is ensured that food security thing
Part INFORMATION DISCOVERY and early warning, the accuracy of target area event occurrence risk prediction and early warning and comprehensive, it is ensured that food is pacified
The efficiency of full Risk-warning.
Description of the drawings
A kind of method flow diagrams of the prediction of event occurrence risk early warning based on internet opening imformation of Fig. 1;
Fig. 2 duplicity rubbish suggestion detection method schematic diagrames;
The recognition methods flow chart of Fig. 3 info webs correlation region;
Fig. 4 zone issue method for early warning schematic diagrames;
The event risk prediction of Fig. 5 target areas, method for early warning schematic diagram.
Specific embodiment
The specific embodiment of the present invention is as shown in figure 1, concrete steps are described below.
1. info web is crawled
The info web in information source is carried out from internet information reptile software (such as Heritrix, Nutch etc.)
Crawl, the internet web page letter required for technology is obtained as far as possible such as during crawling, crawl based on limited range, vertically crawl
Cease and preserved.
2. garbage information filtering
With the development of internet, the webpage quantity of internet and inner capacities are more and more.But meanwhile, the rubbish in webpage
Information is also more and more, is the accurate of guarantee follow-up process, it is necessary to carry out garbage information filtering.Garbage information filtering ring
The web spam page can be specifically divided into filter and rubbish suggestion two aspects of filtration in user-generated content in section.Wherein,
The web spam page can be divided into the content cheating page, the link cheating page;Rubbish suggestion is of different sizes according to its negative effect, can
It is classified as insincere suggestion, low quality suggestion, unrelated suggestion.Insincere suggestion, that is, fraudulent suggestion, one side table
It is now that specific object, event, personage etc. are given not meet superelevation evaluation, compliment of actual conditions etc.;On the other hand also may be used
Can show as providing specific object, event, personage etc. the ultralow evaluation for not meeting actual conditions, abuse, attack etc..Low-quality
Amount suggestion, the general length of this kind of suggestion content are shorter, and its content is probably useful, it is also possible to useless, but due to which
Content is not detailed to specific topic/product description, it is impossible to determine very much the meaning of its opinion mining to specific topics/product
Justice, therefore it is considered as a kind of rubbish suggestion (for computer).Unrelated suggestion, this kind of suggestion be mainly shown as advertisement or
The unrelated content of person and topic.
To the web spam page in a website, the low quality suggestion in user-generated content, unrelated suggestion etc., it is contemplated that
Its characteristics of spam is relatively obvious, can extract the content of sample, interior based on the prior sample set through mark set up
The feature for holding the latitudes such as distribution, link (needs to carry out info web meta-data extraction, text extraction, participle, sentence before extraction feature
Son statistics, paragraph statistics, Anchor Text statistics, link statistics etc. are processed) after set up detection model and detected.With regard to content latitude
Feature, the information to extracting that employs in this method carries out participle, removes stop words and (can adopt document through dimensionality reduction
Frequency method, information gain method etc.) content feature vector-flexible strategy are formed afterwards for term frequencies;With regard to distribution of content feature, this method
In employ the length for heading (number of characters) of information, paragraph number, sentence number, bout length (average), sentence length (average), letter
Breath length (number of characters), Anchor Text number, Anchor Text length (number of characters-average) etc. (are set up in model process, feature are returned
One change is processed, and process is y=x/ (max+1), and wherein x, y are the characteristic value before and after normalization respectively, and max is that in advance website is believed
Maximum in breath set obtained by sample statistics this feature;When before max parameters updating if there is x > max, then x=is taken
Max+1, i.e. y=1);Feature with regard to linking latitude, goes out chain number and to account for always go out chain number ratio in this method in the website for employing information
Example, the outgoing chain number in website of information account for the Information Number always gone out in chain number ratio, Info Link rubbish page set (building in advance)
Account for always go out chain number ratio, the quantity of rubbish page set (build in advance) internal chaining this information accounts for total page number ratio etc..For
The feature of above three dimension, based on the prior junk information set set up and non-spam set, formed respectively feature to
Measure and take machine learning method (such as SVMs etc.) set up junk information detection model (three, based on update
Sample set regularly updates model), can be filtered to freshly harvested information that (information is judged as the rule of junk information afterwards
Be then at least two of which model testing result be positive example).
Meanwhile, it is the identification for solving the problems, such as duplicity rubbish suggestion, uses for reference intelligence system thinking, the identification step of formation is such as
It is shown in Fig. 2, described in detail below.
(1) suggestion set is produced
The information crawled by internet information reptile software in content information source is generated to a certain specific user, which is carried out
Pretreatment (includes the meta-data extractions such as info web author, text extraction, participle, part-of-speech tagging, name entity extraction, sentence
Statistics, paragraph statistics, punctuation mark statistics etc.) consumers' opinions information aggregate is formed after step.
(2) duplicity rubbish suggestion mark
In view of duplicity rubbish suggestion purpose be in order to it is unpractical raise or reduce special object such as website,
The image of webpage, product, personage etc., specifically shows as providing specific object, event, personage etc. and does not meet actual conditions
Superelevation evaluation, compliment etc.;On the other hand it is likely to show as providing specific object, event, personage etc. do not meet reality
The ultralow evaluation of situation, abuse, attack etc..Thus set out, it is contemplated that some points that duplicity rubbish suggestion has in practice
Cloth feature, takes heuristic to be collected the user-generated content for being probably duplicity rubbish suggestion.Specifically, this mistake
Be primarily upon in journey content in user-generated content repeat or the approximate suggestion for repeating, certain hour in the range of issue suggestion amount
In the range of suggestion that highest top-N1 author is issued, certain hour, suggestion amount highest top-N2 special object is related
Suggestion, the related suggestion of suggestion amount highest top-N3 IP address is issued in the range of certain hour, for special object
Suggestion that top-N4 earliest user of cloth suggestion is issued and for the most top-N5 of the suggestion times of revision of special object
The suggestion issued by individual user.
According to above-mentioned rule, the argument information to meeting conditions above in consumers' opinions information aggregate is arranged, and is formed
Candidate's duplicity rubbish suggestion set.Afterwards, it then follows the principle (standard ensured by duplicity rubbish suggestion sample that it is not excessive to be would rather be scarce
True property) and examination & verification confirmation is carried out to the duplicity rubbish suggestion of candidate with reference to modes such as examination & verification, investigations.Two kinds are taken specifically
Method is confirmed that one kind is positive confirmation, and one kind is reversely to confirm.It is so-called it is positive confirm, if that is, argument information content and
Information in duplicity rubbish suggestion knowledge base describes same part thing, the i.e. information content and duplicity rubbish suggestion knowledge
Certain information description in storehouse matches, then be duplicity rubbish suggestion.Data entries in duplicity rubbish suggestion knowledge base increase
Plus rule is:For an argument information, through a period of time process or prove afterwards, the information issued by certain user
Really fraudulent suggestion, in adding knowledge base.Such as contain trimerization in certain forum someone releases news certain brand milk
Cyanamide, but later someone enumerate a variety of reasons illustrate this be it is impossible, afterwards prove the latter be the interior of certain brand milk company
Caused by the deception of clerks or staff members in a department's work.Thus can confirm that this argument information is duplicity junk information, (knowledge base is prior in addition knowledge base
Build and regularly update).So-called reverse confirmation, i.e., it is existing under normal circumstances, occur this type of information be it is impossible, so as to
Duplicity rubbish suggestion is proved from reverse angle.Such as reversely confirm in knowledge base (build in advance and regularly update)
Rule is:A certain user id (such as 1 minute) in setting time has been issued more than N (such as 10 to one or more product
Bar) bar argument information, then these argument informations that the user is delivered are labeled as into duplicity rubbish argument information.This can be matched
Rule an example be:In a certain forum, a certain user id has issued 15 to 3 kinds of different products in the time less than 1 minute
Bar evaluation information, from from the point of view of a normal person, this is impossible.Therefore, this user institute is demonstrated from reverse angle
The duplicity of these information issued.
The information confirmed by said method is labeled, and forms accurate duplicity rubbish suggestion set, while right
The user of duplicity rubbish suggestion is often issued in Jing, that is, is issued the most N number of user of duplicity rubbish suggestion, is added into black name
List is in case later stage identification is used;In addition, according to accurate duplicity rubbish suggestion set etc., the abnormality of summary and induction suggestion author
Behavior (having issued 15 information etc. for 3 kinds of products in 1 minute than such as above-mentioned user) formation rule, for future use.
Notice and clearly confirm that a suggestion is that non-duplicity rubbish suggestion there is also suitable difficulty (for a letter
Breath, it is impossible to be clearly shown to be duplicity rubbish suggestion may also mean that can not explicitly stated its be not duplicity rubbish meaning
See), it is contemplated that the factor such as diversity that time, workload and non-duplicity rubbish suggestion are present, here not to non-deception
Property rubbish suggestion is labeled.
(3) argument information subregion
Each information in the consumers' opinions information aggregate for formed to step (1), (2), marking, extracting which is used for subregion
Characteristic vector, clustered, obtain several information areas, and be calculated the conceptual vector in each region.Calculating process
It is as described below.
To each information extraction in argument information set be used for subregion characteristic vector details are provided below:Extract
The number of words of argument information, word number, suggestion paragraph number, bout length (average), sentence number, sentence length (average), the first person
Pronoun number, second person pronoun number, third person pronoun number, adjective number, adverbial word number, verb number, number of person names, concrete number, machine
Structure concrete number, time number, sigh with feeling number, question mark number, title number of words etc. (during feature is normalized, process is y=
X/ (max+1), wherein x, y are the characteristic value before and after normalization respectively, and max is that in advance information in consumers' opinions information aggregate is united
Maximum obtained by meter this feature;When before max parameters updating if there is x > max, then x=max+1, i.e. y=1 are taken) shape
Into normalized partition characteristics vector.
Clustering processing can be carried out to argument information afterwards, can be using pedigree cluster, non-pedigree cluster etc. in cluster process
Method completes detailed process.
By this process, the characteristic vector of subregion is used for based on consumers' opinions information, by original consumers' opinions information collection
Conjunction is divided into some sub-regions (subregion).The conceptual vector Mark of each subregion is calculated respectivelyi(i is partition number)-pass through
The characteristic vector for calculating all information in the subregion is worth to.
(4) argument information sampling
Each consumers' opinions information subregion formed to step (3), carries out sample sampling (sample size is determined in advance).Take
The method of random sampling, detailed process are as follows:
If sample number to be extracted is S (being determined in advance), the Information Number of each subregion is Ii, then the sample that each subregion should be extracted
Number is Si=S*Ii/∑Ii(this numerical value is general numeral, in the threshold for ensureing that the sample number of each subregion is both greater than previously set
Value and meet S=∑ SiOn the basis of, the sample number of each subregion moderately can be adjusted).
Information in each subregion is numbered, method for numbering serial is that, from the beginning of 1, increasing successively has until all information
One numbering, if maximum number value is MAXi, S is produced using random function afterwardsiIndividual 1-MAXiBetween random number, then this Si
The corresponding argument information of individual random number is the sample that respective partition is drawn.
In sampling process, 10 extractions are carried out according to above-mentioned rule to each subregion, and is cheated in selecting institute's sample drawn
Property rubbish suggestion number most once extraction be final sample, to ensure that duplicity rubbish argument information as much as possible can be extracted
For sample.
So as to obtain the sampling samples set of each argument information subregion.
(5) the secondary mark of argument information sample
The sample for obtaining is extracted to each subregion, secondary arrangement, mark is carried out, divide into duplicity rubbish suggestion, other
Suggestion, so as to the sample of each subregion extraction is formed duplicity rubbish suggestion, without mark argument information sample set.
(6) sample characteristics are extracted
It is to set up detection mould that sample characteristics extraction, algorithms selection etc. are carried out to the sample in each subregion through secondary mark
The necessary step of type.It is a very crucial step that wherein sample characteristics are extracted, and comprising the following steps that in this method is described:
A) the sample argument information content first to extracting carries out participle, removes stop words, and (can adopt through dimensionality reduction
With document frequency method, information gain method etc.) content feature vector (flexible strategy are term frequencies) Q is formed afterwardsj(j is sample number).
B calculate after) sample argument information chain feature include going out in the website of information chain number accounts for the chain number ratio that always go out,
The outgoing chain number in the website of information accounts for the Information Number always gone out in the accurate duplicity rubbish suggestion set of chain number ratio, Info Link and accounts for always
Go out chain number ratio, the quantity of accurate duplicity rubbish suggestion set internal chaining information and account for total page number ratio etc., and join every
(flexible strategy are determined by statistical analysis method in advance, but must ensure that flexible strategy sum, 1) to be calculated total numerical value, is set to for number weighting
Lj。
C) finally calculate Mj=Lj*Qj, obtain the initial characteristicses vector of the sign sample argument information based on content, link
Mj。
D) to some sample Sample in subregion, the initial characteristicses vector based on the sample calculates itself and each subregion
In each sample information similar value (adopting cosine function), and each sample information is sorted from big to small according to similar value, is obtained
Its similar sample sequence.
E) by the classification logotype of the secondary mark of P before in sequence (in advance by analyze determine) individual sample information, (1 represents and is
Duplicity rubbish suggestion sample, -1 indicates without mark sample) and similar value (and the sample between) be multiplied respectively, and form one
Vectorial N of the number of latitude for P, as the final characteristic vector of sample Sample.
Circulation step D according to this)-E), the characteristic vector until being calculated all samples.
(7) set up duplicity rubbish suggestion detection model
Each sample argument information is established after characteristic vector, from from the point of view of judging identification duplicity rubbish suggestion,
Need at present to select machine learning method to set up detection model.The sample set obtained in noticing said process (step (5))
Include the duplicity rubbish suggestion through marking, and without the argument information of mark, but not through the non-deception of mark
Property rubbish suggestion.This means that and can not simply adopt general Supervised machine learning method, because it sets up model needing
Positive example to be provided simultaneously with, counter-example set.So we are employed herein the machine of a kind of " from positive example and without labeled data learning "
Device learning method-biasing SVM (Liu, B., Y.Dai, X.Li, W.Lee, and P.Yu.Building text classifiers
using positive andunlabeled examples.Proceedings of IEEE International
Conference on Data Mining, 2003.).
To each subregion, characteristic vector based on above-mentioned calculated sample and selected " from positive example and without mark
The machine learning method of data learning ", you can set up detection model (each subregion one of recognition detection duplicity rubbish suggestion
Individual model).
(8) duplicity rubbish suggestion detection identification
The duplicity rubbish suggestion detection model of each subregion is set up after finishing, you can internet information reptile is newly crawled
User-generated content carry out duplicity rubbish suggestion judgement identification.On the whole, the judgement identification of duplicity rubbish suggestion is pressed
Carry out according to three steps:Blacklist identification, reversal of identification, model inspection identification.Blacklist identification is carried out first, to belonging to black name
The information that user issues in list, Direct Recognition are duplicity rubbish suggestion;For remaining suggestion, according to the rule that step (2) is concluded
Then according to reversely confirming (i.e. it is existing under normal circumstances, it is impossible this type of information occur, so as to demonstrate,prove from reverse angle
Bright is duplicity rubbish suggestion) mode be identified, for abnormal suggestion, be identified as duplicity rubbish suggestion;For
Remaining suggestion is identified according to the model that step (7) is set up, and identification process is:
The subregion for calculating argument information first (is calculated according to step (3) methods described) with characteristic vector, and calculates suggestion
Distance (being obtained by the distance for calculating the characteristic vector and each subregion sign vector of the argument information) d of information and each subregioni
(i represents partition number).Thus, flexible strategy e of each Subarea detecting model for this argument information are calculatedi=di/∑di。
The detection model that each subregion is respectively adopted to argument information carries out detection identification, obtains testing result Oi(process is
Initially set up the initial characteristicses vector of argument information, find the sample similar with argument information afterwards, obtain final characteristic vector-
The same step of process (6), the model set up using step (7) afterwards obtain testing result), so as to obtain final testing result O
=∑ ei*OiIf final testing result O is more than pre-determined threshold value, is identified as duplicity rubbish to this argument information
Suggestion.
The consumers' opinions information of duplicity rubbish suggestion will be identified as through above-mentioned steps, be labeled as taking advantage of according to standard unification
Deceiving property rubbish suggestion.
(9) duplicity rubbish suggestion detection model updates
In view of the impact brought by duplicity rubbish suggestion, non-duplicity rubbish suggestion dynamic change, deception is being performed
Property rubbish suggestion detection process in be periodically executed detection model and argument information set renewal work (containing max parameters are updated).
Process is as described below:
By through step (8) identification, mark duplicity rubbish suggestion according to being audited with step (2) identical mode
Confirm, the information confirmed by said method is labeled, and new accurate duplicity rubbish suggestion set is formed (for Jing
The user for often issuing duplicity rubbish suggestion, is added into blacklist in case later stage identification is used;Simultaneously to new accurate deception
In property rubbish suggestion set, the Behavior law of suggestion author is summarized formation rule, for future use);Also form new nothing simultaneously
The consumers' opinions information aggregate of mark.
For new consumers' opinions information aggregate, with conceptual vector in original each subregion (to distinguish, Mark hereini
Use MarkoldiMark) closest vector, as initial center, performs cluster process using non-pedigree clustering method, obtains
New consumers' opinions information subregion, calculates the conceptual vector Mark of each new subregionnewi, and calculate new, old subregion sign vector
Between apart from sum Dis=∑ Distance (Marknewi, Markoldi), if Dis is performed more than the threshold value being previously set
Model modification process (same to step (3)-(7)), completes the renewal of model.
After above filtration step, the relative matter of information (non-spam) in follow-up processing procedure is participated in
Amount is higher, and this is accurately providing the foundation for follow-up process.
3. area information finds
It is the Risk-warning that food safety affair is carried out using the information on internet, needs are obtained through certain process
The related information of event.Wherein, obtain internet information in event relevant range be a very important job-as
Basis can determine the generation area of event-this be food safety affair early warning basis, this is accomplished by believing internet web page
Content in breath etc. is carried out extracting, analyzes the region associated to determine food safety affair information.Correlation step is as described below
(as described in Figure 3):
(1) info web pretreatment
Obtain and info web after filtering to crawling, extract its title, source, author, issuing time, issuing web site
The metadata informations such as location are simultaneously preserved, while the body matter for extracting info web is preserved.
To extract info web title, body matter, using segmenter which is carried out based on statistics and dictionary (include according to
The body set up according to step 4 (1) forms dictionary of place name) participle (and record word relative information title and body matter is constituted
Text start, the relative position that terminates, affiliated sentence, the characteristic parameter such as relative position of relative sentence beginning and end), it
Afterwards using based on vocabulary (vocabulary arranges to be formed and be regularly updated in advance, including the word at the same time as name and place name,
There are other specific meanings but while be also likely to be word of place name etc.;One city of such as Wuzhong-Ningxia Hui Autonomous Region, while
It can be name;One county of Founder-Heilongjiang Province, while can be upright company;Although note that the word ratio for containing specific suffix
Not exclude if Wuzhong City) matching process the word that may not be place name is excluded.
(2) nounoun pronoun parsing
There may be in web page title information, text message through participle some represent places pronouns, such as this province,
This city, the province etc..Itself cannot directly show exact geographic location as these pronouns are literal, it is therefore desirable to which which is solved
Analysis.
1) it is the parsing that carries out ground nounoun pronoun, initially sets up the sliding window of pronoun parsing, sliding window length L is true in advance
Fixed (determining such as after the word number distribution situation between analytically nounoun pronoun and its antecedent).
2) after selectively before nounoun pronoun in L word with the presence or absence of rational geographical term (corresponding the Liao Dynasty of such as this province
It is peaceful etc., based on the prior rule judgment set up), if it is present using between the geographical term and ground nounoun pronoun of following foundation
Judged with the presence or absence of the judgment models of the relation that refers to, if there is the relation that refers to, then pronoun pair is determined according to referring to relation
The geographical term answered, parsing terminate (if there is it is multiple refer to relation establishment geographical terms, then chosen distance ground nounoun pronoun most
Near geographical term), otherwise carry out step 3).
If 3) there is no rational geographical term in L word or model judge that referring to relation is not present, and selects
(without departing from whole sentence, such as identified with fullstop) in 2L word before ground nounoun pronoun with the presence or absence of rational geographical term, such as
Fruit is present, then sentenced using the judgment models between the geographical term and ground nounoun pronoun of following foundation with the presence or absence of the relation that refers to
It is disconnected, if there is the relation that refers to, then the corresponding geographical term of pronoun is determined according to the relation that refers to, parsing terminates (if there is many
The individual geographical term for referring to relation establishment, then the chosen distance ground nearest geographical term of nounoun pronoun), otherwise carry out step 4).
If 4) there is no rational geographical term in 2L word or model judge that referring to relation is not present, basis
The information source obtained in metadata extraction process or website location adopt the method for extracting or replacing definitely nounoun pronoun
Refer to place name.
The method for building up of judgment models:The info web for compiling inclusively nounoun pronoun etc. forms sample set, and right
Geographical term in sample set information in each ground nounoun pronoun and its individual word of 2L (rapid 1) of L length syncs) previous (without departing from
Sentence range) between the relation that refers to be labeled, as class variable;To each ground nounoun pronoun in sample set information and its
Relation between geographical term (without departing from sentence range) in 2L (rapid 1) of L length syncs) individual word extracts dependency number before
According to, set up message sample with regard to this over the ground between nounoun pronoun and geographical term relation characteristic vector:Including geographical term suffix
(suffix represents place name or has place name feature, " autonomous region " in such as " Xinjiang Uygur Autonomous Regions ") length (suffix
Number of words is divided by text size), geographical term and ground the distance between nounoun pronoun (word number is divided by text size), geographical term distance
Relative distance (word number is divided by text size) that text starts, nounoun pronoun start apart from text relative distance (word number divided by
Text size), geographical term start apart from sentence relative distance (word number is divided by text size), nounoun pronoun open apart from sentence
(word number is long divided by text for the relative distance that the relative distance (word number is divided by text size) of beginning, geographical term terminate apart from sentence
Degree), the relative distance (word number is divided by text size) that terminates apart from sentence of nounoun pronoun etc.;Machine learning method is selected afterwards
Whether (such as svm) is set up between geographical term and ground nounoun pronoun based on above-mentioned sample set, class variable and characteristic vector
Presence refers to the judgment models of relation.
Based on judgment models between nounoun pronoun and geographical term with the presence or absence of referring to the method judged by relation it is over the ground:
The related data for extracting relation between geographical term and ground nounoun pronoun first forms characteristic vector, and the data of extraction specifically include ground
(word number is divided by text for the distance between reason noun suffix lengths (suffix number of words is divided by text size), geographical term and ground nounoun pronoun
This length), geographical term start apart from text relative distance (word number is divided by text size), nounoun pronoun start apart from text
Relative distance (word number is divided by text size), (word number is long divided by text for the relative distance that starts apart from sentence of geographical term
Degree), the phase that terminates apart from sentence of the relative distance (word number is divided by text size) that starts apart from sentence of nounoun pronoun, geographical term
Adjust the distance (word number is divided by text size), the relative distance (word number is divided by text size) that terminates apart from sentence of nounoun pronoun etc..
Be identified judging based on the judgment models of above-mentioned foundation afterwards, and according to judged result definitely nounoun pronoun and geographical term it
Between the relation that refers to whether there is.
(3) non-standard words parsing
Some words for representing place are there may be in web page title information, text message through participle and has used some
Occur beijing, bj etc. in off-gauge linguistic form, such as Chinese text.In this regard, based on the standard word and non-standard set up
The word table of comparisons (is set up in advance and is regularly updated), to off-gauge place name word form by way of being replaced after inquiry
Parsed.
(4) relative position parsing
There may be some words for representing place to have used relatively in web page title information, text message through participle
The expression way of position, such as southwest China province etc..Likewise, these Expression of language also no clear and definite place name name
Claim.To solve this problem, based on the area information instances of ontology and its add list set up in step 4 (1), to these relative positions
Area information is inquired about and is parsed, and obtains accurate place name word (such as to southwest China province, with reference to the region set up
Information Ontology, first looks for the province title belonging to China, and its place orientation latitude is inquired about to the province belonging to each
Add list, the province that all place orientation are southwest is extracted, and is substituted southwest China province accordingly, is completed parsing).
(5) region determines
Enter the determination work of row information associated area by having carried out after pretreatment and related resolution to info web, this
During mainly include two steps:Pattern match, machine learning judgment models is respectively adopted and enters sentencing for row information relevant range
It is disconnected.
What region determined aims at identification information relevant range, and the discovery for food safety affair information provides region base
Plinth.The problems such as considering accuracy, amount of calculation and operability, the method for taking pattern match during this first enter
OK.Here need to consider two problems:Range of information, matched rule.With regard to matched rule, based on the area that step 4 (1) is set up
Domain information body (i.e. region dimension dimension in body), during it is main consider part body instance name, attribute etc., specifically
The method of pattern match is taken to be judged by combining title, attribute of these instances of ontology etc.;The mould taken in method
Formula matching concrete grammar includes the modes such as the distance matching between Boolean matching, frequency matched, instance name;Specific mode is selected
And specific rules set up by analyzing to Information Statistics after determine (be determined in advance and regularly update).With regard to the choosing of range of information
Select, mainly consider title, two latitudes of the information content of information here, it is contemplated that message header and the information content there may be not
The situation of matching, is processed to the title of information in concrete processing procedure first, if adopting above-mentioned to the title of information
After method for mode matching process, information can be included into currently selected region (such as Beijing), then for the pattern in this region
Matching treatment is finished;Quadratic modes are carried out for this region using above-mentioned method for mode matching to the content of the information otherwise
With process.The principle that it is not excessive to be would rather be scarce is followed during this, ensures the degree of accuracy of identification judged result as far as possible.
If through above-mentioned pattern matching process, this information cannot be included into a certain region, then using based on machine learning
The region decision model that method is set up carries out third time and judges to determine.The process for setting up region decision model in advance is:Based on whole
The info web sample set that reason (same to step (1)-(4)), mark (whether being associated with certain region) are crossed (is set up and regular in advance
Update), by the title of message sample, content word (select and instances of ontology title, attributes match word) combine-
By these words according to administrative place name (referring to province, city etc.), area code, postcode, abbreviation, showplace (mountain, lake, sea, river, island
Small island, building etc.) five classifications carry out sorting out five characteristic vectors of composition (wherein in vector, term weighing is term frequencies, it is considered to
To the importance of title word, pre-determined multiple is multiplied by the weight of title word).Afterwards, using machine learning method
(SVMs etc.) each target area is set up region decision model based on above-mentioned five characteristic vectors (5, based on more
New sample set regularly updates model).Third time is carried out to information and judges that the process for determining is:Will be through step (1)-(4)
Process, parsing after but cannot be included into the title of some region of information, content word (select and instances of ontology title, attribute
The word of matching) combine:According to administrative place name (referring to province, city etc.), area code, postcode, abbreviation, showplace (mountain,
Lake, sea, river, island, building etc.) five classifications carry out sorting out five vectors of composition that (term weighing is word frequency wherein in vector
Rate, it is contemplated that the importance of title word, is multiplied by pre-determined multiple to the weight of title word), and respectively to this five
Vector carries out detection judgement using five region decision models of aforementioned foundation, and the result that detection judges is weighted
(flexible strategy are true divided by the method for word frequency sum in five classifications according to word frequency sum in each classification in info web
It is fixed), if weighing computation results are more than the threshold value being previously set, this information can be included into this region;Otherwise, then this information is not
This region can be included into.
4. zone issue early warning
The step of combined with intelligent systems approach, discovery of design food security area event information early warning, is as shown in figure 4, tool
Body is described below.
(1) set up body
In view of the characteristics of food safety affair and event information extract, follow the trail of etc. analysis needs, in food security
In the building process of event information body, mainly consider to set up from object, region, time, result, five latitudes of association person.Than
Such as object instant food, the classifications such as head product, converted products can be divided into, head product can be divided into the classifications such as veterinary antibiotics again, with this
Analogize;Such as result can be divided into the classifications such as pollution, poisoning, and pollution can be divided into the classification such as expired, exceeded again, by that analogy;Than
As five classifications can be divided on region populations, it is Asia, Europe, A Feili californias, America continent, ocean respectively
Continent;Each classification can be finely divided again, such as Asia can be divided into East Asia, West Asia, South Asia, north Asia, the Central Asia, the southeast
Sub- six classifications, by that analogy;Until be categorized into only can not be further divided into, the element (i.e. example) of an as bottom.Other
The building process of classification is similar to.Meanwhile, for each example in body, corresponding synonym, antonym, not are established respectively
The add lists such as noun;Additionally, for the example in area information body, establish respectively area code, postcode, abbreviation,
Showplace (mountain, lake, sea, river, island, building), adjacent domains (the adjacent peer domain in the direction such as east, south, west, north), place orientation (phase
For upper level, such as middle part, south etc.) six latitudes add list, in case used in information process.
(2) information filtering
In view of there may be on a website and the incoherent content of predetermined theme situation, in order to improve event information
It was found that, the degree of accuracy of early warning, before subsequent treatment is carried out to information, first information is carried out filtering-food security information mistake
Filter.
Food security information is filtered, that is, judge whether gathered information belongs to the related information of food security.Here
Need to consider two problems:Range of information, filtering rule.With regard to filtering rule, based on the food safety affair information sheet set up
Body, during two latitudes of primary consideration and result, specific title by combining the instances of ontology of the two latitudes,
Attribute etc. takes the method for pattern match to be filtered;The pattern match concrete grammar taken in method include Boolean matching,
The modes such as distance matching, the synonymous antisense matching of instance name, instance name alias match between frequency matched, instance name;Tool
(be determined in advance and regularly update) is determined after the mode of body is selected and specific rules are set up by analyzing to Information Statistics.With regard to letter
The selection of breath scope, mainly considers title, two latitudes of the information content of information, it is contemplated that message header and the information content here
Unmatched situation is there may be, first the title of information is processed in concrete processing procedure, if through believing to title
After breath is filtered, information can be included into food security information classification, then this information is disposed;Content otherwise to information
Carry out secondary judgement process.
After above filtration step, (instant food is safety-related for the information for participating in follow-up processing procedure
Non-spam) relative mass is higher, and this is accurately providing the foundation of processing of follow-up.
(3) object information finds
The object information discovery of info web be object type identification, that is, determine info web described by content and which kind of
Object is about (and relevant with which kind of event factor, which kind of consequence caused) etc..Its objective is with reference to finding in info web
Area information, object information etc. uniquely determine event as far as possible.
For this purpose, the problems such as considering the accuracy of identification, amount of calculation and operability, during take regression analysis
Method carry out.The range of information adopted in method, is that the message header and content of each webpage combine, and carries out
Participle, remove stop words, dimensionality reduction after to form the characteristic vector (as independent variable) of the webpage-wherein term weighing be term frequencies,
In view of the importance of title word, pre-determined multiple is multiplied by the weight of title word;Likewise, to and body in it is right
As, the term weighing of result, association person's instance name, attributes match is multiplied by pre-determined multiple.For each object type,
The characteristic vector data of above-mentioned webpage is substituted into into corresponding logistic regression models (in advance to need species and the foundation of differentiation
Sample set based on set up model) in, judged according to Regression Analysis Result, this info web whether with this object type
There is relation.
Wherein, the method for building up of regression analysis model is:Based on the info web sample set for arranging, marking (in advance
Set up and regularly update), after combining the title of message sample, content word and carry out participle, remove stop words, dimensionality reduction
It is term frequencies to form characteristic vector (as independent variable)-wherein term weighing, it is contemplated that the importance of title word, to title
The weight of word is multiplied by pre-determined multiple;Likewise, to and body in object, result, association person's instance name, attribute
The term weighing matched somebody with somebody is multiplied by pre-determined multiple;While the object type belonging to info web is labeled, and (1 expression belongs to
This object type, 0 represent and are not belonging to this object type, used as dependent variable), pin is set up using logistic methods based on this
Regression analysis model to each object type.
(4) trend tracking, event early warning
From from the point of view of practice, with reference to area information, object type information found in abovementioned steps etc., you can align
The event (representing the related information of event with the common factor of the information for belonging to above-mentioned two latitude) that true determination occurs.
On the basis of the region of info web and the identification of object type key element, the characteristic parameter-tool of expression event is set up
It is the employing of the body information page number related to event, page browsing number, page forwarding number, specific website page browsing number, specific
Under domain name, website page browsing number and composite index (are obtained by the method summary parameter for weighting, flexible strategy pass through Dare
1) etc. Philippine side method determines, but need to ensure flexible strategy sum for the feature that represents event, and periodically (such as every 1 hour) joins to feature
Number carries out calculating process.And according to the change of time, the situation of change of comprehensive analysis these affair character parameters.
On the basis of above-mentioned event trend is followed the trail of, periodically (such as per 12 hours) calculates each characteristic parameter of expression event
(including composite index) numerical value, and the average in event current each characteristic ginseng value and its regular period previous (is examined at present
The characteristics of considering network event and propagate, one month is have selected as calculating cycle, also can be adjusted according to situation) it is compared,
If difference is just and absolute value is more than certain threshold value (such as 3 times of standard deviation, threshold value are previously set), then part enters as to this
Row early warning is initialized.
Carry out the initialized event of early warning afterwards to be tracked to this, periodically (such as per 12 hours) calculates expression event
Each characteristic parameter (include composite index) numerical value, and by event current each characteristic ginseng value and its regular period previous
Average (is presently contemplated that the characteristics of network event is propagated, selects month before early warning initialization as calculating cycle, also may be used
It is adjusted according to situation) it is compared, if difference continues (such as 24 hours, be determined in advance) more than certain threshold value (such as 3
Standard deviation again, threshold value are previously set), then part carries out formal early warning as to this.The early warning for otherwise cancelling part as to this is initial
Change and arrange.
Wherein threshold value determination method is:In history (in such as 1 year) the delta data base of each characteristic parameter of Collection Events
On plinth, and combine and (can pacify from food through the time of origin of history food safety affair that confirms, region, the data such as scale
Total correlation administrative department obtains), calculate the average of each characteristic ginseng value of event and (such as one month) in its regular period previous
Between difference form variable-as independent variable, would indicate that whether special properties food safety affair occurs (1 represent occur, 0
Expression does not occur) variable as dependent variable, using the method for logistic regression analyses set up above-mentioned independent variable, dependent variable it
Between regressive prediction model.Based on this model, the historical variations trend characteristic of binding events characteristic parameter, selection can cause because
Variate-value be 1 suitable argument value as threshold value.
(6) event terminates to judge
The event of alignment type early warning, on the basis of above-mentioned event trend is followed the trail of, regular (such as per 12 hours) computational chart
Show each characteristic parameter (the include composite index) numerical value of event, and by event current each characteristic ginseng value and its previous one regularly
Average in phase (is presently contemplated that the characteristics of network event is propagated, have selected from early warning and start to start day to calculating the previous day day
Till as calculating cycle, also can be adjusted according to situation) be compared, if difference is negative and absolute value is more than certain threshold
Value (such as 3 times of standard deviation, threshold value are previously set), then it is assumed that this event terminates.Terminate the early warning of part as to this.
(7) body is supplemented and is corrected
Find in event information, during the entire process of early warning, the food safety affair Information Ontology of structure to information filtering,
The performance of the steps such as INFORMATION DISCOVERY has important impact.Accordingly, it is considered to the changes in distribution feature of internet information is arrived, from persistently
The angle of raising method efficiency is set out, and needs are periodically estimated to the result of the processes such as information filtering, INFORMATION DISCOVERY.And to this
Deficiency in body is omitted, mistake etc. is supplemented, corrected, the efficiency follow-up to improve method.
5. the prediction of target area event risk and early warning
In the case where particular event occurs in some regions, periodically calculate target area (currently not occurring) and this thing occurs
The possibility of part and possible time of origin, and the early warning of different stage is carried out (such as Fig. 5 institutes according to the result of analytical calculation
Show).Calculate the model (regularly updating) before target area occurs the possibility and possible time of origin of particular event to set up
Process is:
The region (such as provincial region Hebei, Henan etc.) with administrative grade with target area (such as Beijing) is selected,
Collect the time of origin of history food safety affair of these regions (containing target area, if sum is R) through confirming, region,
On the basis of the data such as scale (can obtain from food security regulatory authorities), formed certain food security incident where,
The data acquisition system for when occurring.Based on this, the difference for whether particular event occurring according to a region sets up network, figure
Summit be above-mentioned regional, food safety affair, if a region there occurs particular event, above-mentioned zone, thing
A side is produced between the summit of part mark, and the weight on side is the number of times that this kind of situation occurs.Further, network is turned
It is changed to the matrix A (be previously formed and regularly update) of a R*S (R is number of regions, and S is food safety affair number).
Meanwhile, the generation between the time that particular event occurs according to target area and the region that corresponding event occurs earliest
The difference of the difference of time, set N number of time range (can set 5 time periods, such as target area occur particular event when
Between distance occur earliest the event time be in 1 day, in 3 days, in 1 week, 2 weeks interior, 5 time periods in January), respectively to original
The data acquisition system of beginning is labeled (indicate whether particular event occurs in each region in the above-mentioned time period respectively), respectively shape
Into N number of (in the case of 5 time periods of setting, forming 5 data acquisition systems) data acquisition system (be previously formed and regularly update).Here
On the basis of, whether target area in data acquisition system in above-mentioned time range particular event occurs, and used as dependent variable, (1 represents
Occur, 0 represents do not occur), whether remaining region there is corresponding event as independent variable (1 represents occur, and 0 represents do not occur),
(5, use C to the regressive prediction model set up between above-mentioned independent variable, dependent variable using the method for logistic regression analyses1、C2、
C3、C4、C5Represent, be previously formed and regularly update).
On this basis, the process of the possibility and possible time of origin of calculating target area generation particular event is:
It is different according to the current region that particular event occurs, the respective element in matrix A is updated, afterwards matrix A is adopted
The method of matrix decomposition is processed, and (such as using svd methods, its processing procedure is first by matrix A to form new matrix B
Carry out singular value decomposition:A=TySyDy, wherein TyFor R*F battle arrays, SyFor F*F diagonal matrixs, DyFor F*S battle arrays, orders of the F for matrix A;If
Determine positive integer K, 0 < K < F only consider SyK maximum singular value of intermediate value, takes S accordinglyyIn corresponding K rank diagonal matrix-be set to
Sm、TyIn corresponding K arrange-be set to Tm、DyIn corresponding K rows-be set to Dm;The inverse operation of singular value decomposition, B=are carried out afterwards
TmSmDm, complete processing procedure).The matrix element value of target area and particular event correlation is identified in finding matrix B afterwards,
If it greater than the threshold value being previously set, then target area is can determine that it may happen that particular event;Otherwise, it may be determined that target
Region may not occur particular event.
If particular event can occur according to determination target area after above-mentioned deterministic process, then special according to occurring at present
The region for determining event forms the value of each independent variable (1 represents occur, and 0 represents do not occur), and substitutes into above-mentioned regressive prediction model
Judgement is analyzed, judgement order is according to C5、C4、C3、C2、C1Order carry out successively.Specific practice is if according to C5Sentence
Disconnected result is true (can occur), then carry out C4Judgement;If result is false (will not occur, i.e., may occur after 1 month),
Then stop judging.The rest may be inferred, until judged result is false or all judges to finish, may occur so as to obtain target area
This event time (be last judged result be time range representated by genuine regressive prediction model, if such as C2
Model is that last judged result is genuine model, then the time of origin that can be predicted target area particular event may be at 1 day
Afterwards in 3 days).So as to the early warning of different time rank can be carried out to the risk that target area occurs particular event.
6. result shows and services
Whether occurring to target area particular event, when be predicted, on the basis of early warning analysis, will analysis
The result for obtaining shows user by way of form, figure etc..And provide short message, mail etc. send immediately send out service side
Formula.
Thus, intactly realize from crawling the food safety affair information that extract in the internet information that obtains, and according to
Event evolution, the event risk of target area carry out early warning and the overall process for user service in time.During, by adopting
Take garbage information filtering, area information discovery, object type INFORMATION DISCOVERY, trend to follow the trail of and early warning, risk profile and early warning etc.
It is accurate with early warning, risk profile and early warning that technology ensure that event information finds.This will be the risk for food safety affair pre-
Alert, quick emergency processing etc. provides important Information base.
What deserves to be explained is, the present invention cannot be only used for the contingency management of food safety affair, slightly transform, you can application
To others, can obtain from internet in the emergency processings such as the Risk-warning of unconventional accident of event information work.
Claims (10)
1. a kind of event occurrence risk based on internet opening imformation predicts and method for early warning that its step is:
1) a food safety affair Information Ontology is set up, and an add list is set up respectively to each example in body;
2) info web to crawling carries out rubbish filtering, obtains non-junk info web;
3) word to representing place in the info web after filtration is parsed, and obtains accurate place name word;Based on described
In food safety affair Information Ontology, the instances of ontology title of region dimension, attribute adopt method for mode matching to the net after parsing
Page information is processed, and info web is included into the region that the match is successful;
4) info web is filtered, obtains the info web related to food security;Then for the object of each setting
Classification, is processed to the info web after filtration using regression analysis model, judges the related object class of each info web
Not;
5) according to step 3), the object type of the info web affiliated area 4) determined and its correlation, obtain setting regions, right
The info web set of the event of elephant, sets up the characteristic parameter of event and periodically calculates characteristic ginseng value, if the spy of certain event
Levying the lasting setting time of parameter value then carries out early warning to the event more than given threshold;
6) if a setting object event early warning occurs in certain region, target is periodically calculated based on matrix analysis and regressive prediction model
There is the possibility of the setting object event and possible time of origin in region, and carry out the Risk-warning of different stage;
Wherein, to the method parsed by the word for representing place in info web it is:
A) for ground nounoun pronoun, judge to whether there is between ground nounoun pronoun and its geographical term for above occurring with a judgment models
Relation is referred to, if it is present ground nounoun pronoun is replaced with corresponding geographical term;
B) non-standard place name word in word is parsed based on standard word and the non-standard word table of comparisons, by non-standard words
Language replaces with standard word;
C) based on the region dimension in the food safety affair Information Ontology, the relative position area information in word is carried out
Parsing, obtains accurate place name word;
Wherein, the method for building up of the judgment models is:The info web of inclusively nounoun pronoun is formed into a sample set, and it is right
In sample set the relation that refers between nounoun pronoun and the geographical term before which be labeled, as class variable;Set up
The characteristic vector of relation between ground nounoun pronoun and the geographical term before which:Then machine learning method is selected to be based on the sample
Set, class variable and characteristic vector set up the judgment models between geographical term and ground nounoun pronoun with the presence or absence of the relation that refers to;
Wherein, judge between ground nounoun pronoun and its geographical term for above occurring with the presence or absence of the method for the relation that refers to be:Calculate
Between ground nounoun pronoun and geographical term, the characteristic vector value of relation, is sentenced to the characteristic vector value using the judgment models
Disconnected, definitely the relation that refers between nounoun pronoun and geographical term whether there is.
2. the method for claim 1, it is characterised in that the duplicity rubbish suggestion in the info web that crawls is carried out
The method of filtration is:
21) webpage that selected user generates content information source is crawled, and a consumers' opinions information collection is set up according to the webpage for crawling
Close;Consumers' opinions information aggregate is clustered, several information areas is obtained, and is calculated all information in each information area
Characteristic vector average, as the conceptual vector of the information area;
22) sample sampling is carried out to the consumers' opinions information in each information area, obtains the sample set of each information area;
23) sample in the sample set of each information area is labeled, obtains the duplicity rubbish of each information area
Suggestion sample set and without mark argument information sample set;
24) to each sample, the P sample most like with which in the sample set of each information area of searching, based on the P sample
Classification logotype and its with the Similarity value between the sample, obtain the final characteristic vector of the sample;
25) the final characteristic vector based on each sample, selects machine learning method to set up a deception for each information area
Property rubbish suggestion detection model;
26) information in consumers' opinions information aggregate is filtered using duplicity rubbish suggestion detection model.
3. method as claimed in claim 2, it is characterised in that the method for obtaining the sample set of each information area is:
First the information to being defined as duplicity rubbish suggestion in the consumers' opinions information aggregate is labeled, and sets up one and accurately cheats
Property rubbish argument information set;Then to argument information subregion after, to each subregion according to taking out at random in sample sampling process
The method of sample is repeatedly extracted, and according to duplicity rubbish in built duplicity rubbish argument information Resource selection institute sample drawn
The most final sample once extracted as the subregion of rubbish suggestion number, obtains the sample set of each information area.
4. method as claimed in claim 2 or claim 3, it is characterised in that to each sample, with the content of sample and link latitude
Characteristic parameter forms its initial characteristicses vector, the P sample most like with which in the sample set of each information area of searching.
5. method as claimed in claim 2, it is characterised in that consumers' opinions is believed using duplicity rubbish suggestion detection model
Information in breath set carries out, in filter process, setting up weight coefficient based on the distance of argument information and each information area,
Each duplicity rubbish suggestion detection model is carried out into aggregative weighted to the testing result of consumers' opinions information, final inspection is obtained
Survey result;Consumers' opinions information is labeled according to final testing result.
6. method as claimed in claim 2, it is characterised in that the computational methods of the final characteristic vector of the sample are:
A) the sample argument information content first to extracting carries out participle, removes stop words, and in being formed after dimensionality reduction
Hold characteristic vector Qj, j is sample number;
B the chain feature of sample argument information) is calculated, and every chain feature is weighted is obtained total numerical value, if
For Lj;
C) calculate Mj=Lj*Qj, obtain the initial characteristicses vector M of the sign sample argument information based on content, linkj;
D) to information area in each sample Sample, based on the sample initial characteristicses vector, calculate which with each information area
The similar value of each sample information in domain, and each sample information is sorted from big to small according to similar value, obtain its similar sample
Sequence;
E) classification logotype of front P sample information in sample sequence is multiplied respectively with corresponding similar value, it is P to form a number of latitude
Vectorial N, as the final characteristic vector of sample Sample.
7. method as claimed in claim 2, it is characterised in that periodically to the set of accurate duplicity rubbish argument information and without mark
The consumers' opinions information aggregate of note is carried out supplementing, is updated, and then the consumers' opinions information aggregate after renewal is clustered, and is calculated
The distance between each information area current flag vector and last conceptual vector are simultaneously sued for peace and obtain accumulated value Dis, when Dis values
During more than the threshold value being previously set, the duplicity rubbish suggestion detection model of each information area is updated.
8. method as claimed in claim 2, it is characterised in that consumers' opinions information aggregate is carried out the feature of cluster analysis to
Measure and be:Extract the number of words of argument information, word number, suggestion paragraph number, bout length average, sentence number, sentence length average, first
Personal pronoun number, second person pronoun number, third person pronoun number, adjective number, adverbial word number, verb number, number of person names, place name
Number, mechanism's concrete number, time number, sigh with feeling number, question mark number and title number of words, and which is normalized obtains to consumers' opinions
Information aggregate carries out the characteristic vector of cluster analysis.
9. the method for claim 1, it is characterised in that the ground nounoun pronoun to representing place in info web is parsed
Method be:
91) set up the sliding window of the length for L of pronoun parsing;
92) selectively whether there is geographical term before nounoun pronoun in L word, if it is present being sentenced using judgment models
It is disconnected, if there is the relation that refers to, then the corresponding geographical term of pronoun is determined according to referring to relation, parsing terminates, is otherwise walked
It is rapid 93);
93) selectively whether there is geographical term before nounoun pronoun in 2L word, if it is present being sentenced using judgment models
It is disconnected, if there is the relation that refers to, then the corresponding geographical term of pronoun is determined according to referring to relation, parsing terminates, is otherwise walked
It is rapid 93);
94) it is true using the method for extracting or replace according to the information source or website location obtained in metadata extraction process
Surely nounoun pronoun refers to place name.
10. the method as described in claim 1 or 2 or 9, it is characterised in that calculate the possibility that target area occurs the setting event
Property and possible time of origin, and carry out the method for the Risk-warning of different stage and be:
11) the historical event information set with target area with the region of administrative grade is selected, based on the historical event information collection
Build vertical event network jointly;Wherein, the summit mark regional of event network, food safety affair, if a region
A certain event is there occurs, then a side, and the power on side are produced between the summit for identifying the region and the summit for identifying the event
Weight is the number of times that the event occurs;
12) the event network is converted to the matrix A of a R*S;Wherein, R is number of regions, and S is food safety affair number;
13) based on above-mentioned historical event information set, there is setting incident distance according to target area and the event occurs earliest
Time it is different, set N number of time range, respectively the historical event information set be labeled for each time range,
Form N number of data acquisition system;
14) to above-mentioned each data acquisition system, whether target area setting event occurs as because becoming in corresponding time range
Whether amount, remaining region there is corresponding event as independent variable, using regression analysis set up respectively independent variable, dependent variable it
Between regressive prediction model;
15) respective element in matrix A is updated, matrix A is processed using matrix disassembling method, form new matrix B;
16) matrix element value of target area and setting event correlation is identified in finding matrix B, if it greater than being previously set
Threshold value, it is determined that target area is it may happen that the setting event;Otherwise, the setting event will not occur;
17) if it is determined that target area future can occur the setting event, then obtained according to the region that the setting event occurs at present
To the value of independent variable, substitute into above-mentioned regressive prediction model and judged that target area being obtained according to judged result may set
Determine the temporal predictive value of event;
18) according to above-mentioned risk profile result, the early warning of different stage is carried out to the risk that target area occurs setting event.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210501872.7A CN103854063B (en) | 2012-11-29 | 2012-11-29 | A kind of prediction of event occurrence risk method for early warning based on internet opening imformation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210501872.7A CN103854063B (en) | 2012-11-29 | 2012-11-29 | A kind of prediction of event occurrence risk method for early warning based on internet opening imformation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103854063A CN103854063A (en) | 2014-06-11 |
CN103854063B true CN103854063B (en) | 2017-04-05 |
Family
ID=50861693
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210501872.7A Active CN103854063B (en) | 2012-11-29 | 2012-11-29 | A kind of prediction of event occurrence risk method for early warning based on internet opening imformation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103854063B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104123368B (en) * | 2014-07-24 | 2017-06-13 | 中国软件与技术服务股份有限公司 | The method for early warning and system of big data Importance of Attributes and identification based on cluster |
CN104156402B (en) * | 2014-07-24 | 2017-06-13 | 中国软件与技术服务股份有限公司 | A kind of normal mode extracting method and system based on cluster |
CN104142986B (en) * | 2014-07-24 | 2017-08-04 | 中国软件与技术服务股份有限公司 | A kind of big data Study on Trend method for early warning and system based on cluster |
CN106548189B (en) * | 2015-09-18 | 2019-06-21 | 阿里巴巴集团控股有限公司 | A kind of event recognition method and equipment |
CN107025596B (en) * | 2016-02-01 | 2021-07-16 | 腾讯科技(深圳)有限公司 | Risk assessment method and system |
CN107247742A (en) * | 2017-05-17 | 2017-10-13 | 武汉工程大学 | A kind of text message abstracting method based on web page characteristics |
CN110334720A (en) * | 2018-03-30 | 2019-10-15 | 百度在线网络技术(北京)有限公司 | Feature extracting method, device, server and the storage medium of business datum |
CN110086829B (en) * | 2019-05-14 | 2021-06-22 | 四川长虹电器股份有限公司 | Method for detecting abnormal behaviors of Internet of things based on machine learning technology |
CN110457595B (en) * | 2019-08-01 | 2023-07-04 | 腾讯科技(深圳)有限公司 | Emergency alarm method, device, system, electronic equipment and storage medium |
CN113051573B (en) * | 2021-02-19 | 2021-11-02 | 广州银汉科技有限公司 | Host safety real-time monitoring alarm system based on big data |
CN113051315B (en) * | 2021-03-26 | 2022-08-19 | 中国气象局公共气象服务中心(国家预警信息发布中心) | Information quantity calculation system for emergency early warning information |
CN114565196B (en) * | 2022-04-28 | 2022-07-29 | 北京零点远景网络科技有限公司 | Multi-event trend prejudging method, device, equipment and medium based on government affair hotline |
CN117131944B (en) * | 2023-10-24 | 2024-01-12 | 中国电子科技集团公司第十研究所 | Multi-field-oriented interactive crisis event dynamic early warning method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101488150A (en) * | 2009-03-04 | 2009-07-22 | 哈尔滨工程大学 | Real-time multi-view network focus event analysis apparatus and analysis method |
JP2010128806A (en) * | 2008-11-27 | 2010-06-10 | Hitachi Ltd | Information analyzing device |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
CN102193951A (en) * | 2010-03-19 | 2011-09-21 | 华为技术有限公司 | Information extracting method and system |
CN102567393A (en) * | 2010-12-21 | 2012-07-11 | 北大方正集团有限公司 | Method, device and system for processing public sentiment topics |
CN102708096A (en) * | 2012-05-29 | 2012-10-03 | 代松 | Network intelligence public sentiment monitoring system based on semantics and work method thereof |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070088794A1 (en) * | 2005-09-27 | 2007-04-19 | Cymer, Inc. | Web-based method for information services |
-
2012
- 2012-11-29 CN CN201210501872.7A patent/CN103854063B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010128806A (en) * | 2008-11-27 | 2010-06-10 | Hitachi Ltd | Information analyzing device |
CN101488150A (en) * | 2009-03-04 | 2009-07-22 | 哈尔滨工程大学 | Real-time multi-view network focus event analysis apparatus and analysis method |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
CN102193951A (en) * | 2010-03-19 | 2011-09-21 | 华为技术有限公司 | Information extracting method and system |
CN102567393A (en) * | 2010-12-21 | 2012-07-11 | 北大方正集团有限公司 | Method, device and system for processing public sentiment topics |
CN102708096A (en) * | 2012-05-29 | 2012-10-03 | 代松 | Network intelligence public sentiment monitoring system based on semantics and work method thereof |
Non-Patent Citations (1)
Title |
---|
用户生成内容中垃圾意见研究综述;杨风雷 等;《计算机应用研究》;20111031;第28卷(第10期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN103854063A (en) | 2014-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103854063B (en) | A kind of prediction of event occurrence risk method for early warning based on internet opening imformation | |
CN103854064B (en) | Event occurrence risk prediction and early warning method targeted to specific zone | |
CN103176981B (en) | A kind of event information excavates and the method for early warning | |
Bozarth et al. | Toward a better performance evaluation framework for fake news classification | |
CN103853700B (en) | A kind of event method for early warning found based on region and object information | |
CN103853738B (en) | A kind of recognition methods of info web correlation region | |
CN103853744B (en) | Deceptive junk comment detection method oriented to user generated contents | |
CN105005594B (en) | Abnormal microblog users recognition methods | |
CN105138570B (en) | The doubtful crime degree calculation method of network speech data | |
Kalampokis et al. | Combining social and government open data for participatory decision-making | |
CN110457404A (en) | Social media account-classification method based on complex heterogeneous network | |
CN106940732A (en) | A kind of doubtful waterborne troops towards microblogging finds method | |
CN103176984B (en) | Duplicity rubbish suggestion detection method in a kind of user-generated content | |
CN101394311A (en) | Network public opinion prediction method based on time sequence | |
CN102946331A (en) | Detecting method and device for zombie users of social networks | |
Yamak et al. | Detection of multiple identity manipulation in collaborative projects | |
Petroni et al. | An extensible event extraction system with cross-media event resolution | |
CN107305545A (en) | A kind of recognition methods of the network opinion leader based on text tendency analysis | |
Hofmann et al. | The reddit politosphere: a large-scale text and network resource of online political discourse | |
Ruffo et al. | Surveying the research on fake news in social media: a tale of networks and language | |
Cao et al. | Fake reviewer group detection in online review systems | |
Sharma et al. | Going beyond content richness: Verified information aware summarization of crisis-related microblogs | |
Abu Talha et al. | Scrutinize artificial intelligence algorithms for Pakistani and Indian parody tweets detection | |
Mouty et al. | Survey on steps of truth detection on Arabic tweets | |
Arafat et al. | Popularity prediction of online news item based on social media response |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |