CN110209813A - A kind of incident detection and prediction technique based on autocoder - Google Patents

A kind of incident detection and prediction technique based on autocoder Download PDF

Info

Publication number
CN110209813A
CN110209813A CN201910401627.0A CN201910401627A CN110209813A CN 110209813 A CN110209813 A CN 110209813A CN 201910401627 A CN201910401627 A CN 201910401627A CN 110209813 A CN110209813 A CN 110209813A
Authority
CN
China
Prior art keywords
topic
text
theme
value
time window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910401627.0A
Other languages
Chinese (zh)
Inventor
于健
王帅杰
徐天一
高洁
赵满坤
喻梅
于瑞国
原旭莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201910401627.0A priority Critical patent/CN110209813A/en
Publication of CN110209813A publication Critical patent/CN110209813A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of incident detection and prediction technique based on autocoder, the described method comprises the following steps: carrying out Chinese word segmentation to data text and deactivate processing;By treated, data text is indicated with text vector, and carries out dimensionality reduction operation to text vector;The similarity between current text and each theme is calculated, and the similarity between text and all themes is ranked up from small to large, takes the maximum value of similarity to be compared with threshold value and judges the affiliated theme of current text or re-establish new theme;Topic hot value is calculated, the time window of the first time appearance of the newsletter archive of a certain topic will be belonged to and predicts that the topic is less than the event classification of specified threshold into emergency event as the difference of the time window of hot topic.The present invention can effectively overcome convert vector for news documents after, the vector dimension of document is higher, and has the problem of sparsity, while more adapting to the case where newsletter archive quantity is changed over time and changed in practice.

Description

A kind of incident detection and prediction technique based on autocoder
Technical field
The present invention relates to text detection and prediction field more particularly to a kind of incident detections based on autocoder With prediction technique.
Background technique
At present in the related art, the text representation technology of emergency event is mainly autocoder (AutoEncoder). In deep learning, autocoder is used for the training stage, carries out Feature Conversion to the data of input, that is, encodes the data to another Then a kind of form carries out a series of study on this basis.The essence of autocoder is the network node using hidden layer The neuron of input layer is reconstructed, even if the output of neural network is similar to the input information of network as much as possible, trained In the process, loss function is continued to optimize using the method for backpropagation to obtain smaller penalty values.Due to neural in hidden layer Competition between member, each neuron becomes specially identifying specific data pattern, so as a complete unit, autocoder can Significant text representation is arrived with study.In image data set representations field, autocoder, which has been obtained, to be widely applied And relatively good effect is obtained.But since text is sufficiently complex, such as: high-dimensional, sparsity and the distribution of power law word etc., Traditional autocoder may be more likely to the simple expression of learning text, their performances on text data set are not yet To extensive research.
The detection technique of emergency event is mainly Single-Pass clustering technique.Single-Pass clustering algorithm it is main Thought is the successively matching degree to input text to determine between current text and existing cluster result.If current text with Current text, then be attributed to wherein, otherwise will create new gathering by existing result matching.However, in Single-Pass algorithm In, the sequence of the text of input model will directly influence the final effect of cluster, if moreover, traditional single threshold value was arranged Height will cause cluster result granularity it is too small so that reduce cluster result recall rate, if threshold value setting it is excessively high will cause cluster knot The granularity of fruit is too big and then reduces the accuracy rate of cluster result.
The Predicting Technique of emergency event is mainly based upon the prediction technique of the growth rate of topic hot value.Utilize topic temperature The growth rate of value window variation at any time predicts focus incident and emergency event, specifies topic heat degree threshold FThre, first derivative Threshold value DThre1 and second dervative threshold value DThre2 calculates the theme temperature when topic hot value is greater than threshold value FThre first First derivative, if the first derivative of the theme be greater than DThre1, and the second dervative of the theme be not less than Dthre2, then it is assumed that The topic is focus incident, and otherwise the topic is not belonging to focus incident.But there are obvious shortcomings for this method, because of the temperature of topic Value is calculated based on time window, so the hot value of topic is not continuous, and in a practical situation, due in network The sudden and complexity of emergency event, the hot value of topic fluctuate up and down, and topic hot value is the variation of window at any time And fluctuate, and in the prediction technique preferably based on growth rate, the single order growth rate of Default Subject hot value only has One highest point.Therefore, this method is unsatisfactory to the prediction effect of focus incident and emergency event in practical application.
Summary of the invention
The present invention provides a kind of incident detection and prediction technique based on autocoder, the present invention can be effective Overcome after converting vector for news documents, the vector dimension of document is higher, and has the problem of sparsity, while more adapting to real The case where newsletter archive quantity is changed over time and is changed in border, described below:
A kind of incident detection and prediction technique based on autocoder, the described method comprises the following steps:
Chinese word segmentation is carried out to data text and deactivates processing;By treated, data text is indicated with text vector, and Dimensionality reduction operation is carried out to text vector;
Calculate the similarity between current text and each theme, and by the similarity between text and all themes from it is small to It is ranked up greatly, takes the maximum value of similarity to be compared with threshold value and judge the affiliated theme of current text or re-establish new master Topic;
Topic hot value is calculated, time window and the prediction words that the first time of the newsletter archive of a certain topic occurs will be belonged to The difference that topic becomes the time window of hot topic is less than the event classification of specified threshold into emergency event.
Wherein, the calculating topic hot value specifically: the energy attenuation based on RD calculates topic hot value.
Further, the method also includes:
Emergency event is predicted based on growth rate, and when whether judge a certain theme is emergency event, which is become into heat The time window that the time window of point event the newsletter archive for belonging to the topic occurs with first time compares.
It is wherein, described that dimensionality reduction operation is carried out to text vector specifically:
Threshold value R is added in hidden layer, when the absolute value of the energy value of the failure node in hidden layer is greater than given threshold value R When, then regard the node as the study to autoencoder network effective;
When the activation value of failure neuron is less than given threshold value R, the activation value of failure neuron is added to successfully neural In member, and the value of failure neuron is set to zero.
It is further, described that emergency event is predicted based on growth rate specifically:
Define the growth rate curve of theme hot value:
Dt=G [F (yt)],F(yt)∈[F(yA),F(yB)]
Wherein, A represents the time point of maximum rate of growth when theme temperature curve is in build phase, and B indicates theme heat Line write music by increasing the time point tended to be steady, F (yA) and F (yB) topic is respectively indicated in the topic hot value of A point and B point;
Meanwhile for the fluctuation problem of the growth rate of topic hot value as caused by time window, using following formula to growth Rate is smoothed;
Wherein, DtRepresent real growth rate of the theme on time window t, δiIndicate the topic heat at corresponding time window t Smoothing factor corresponding to angle value growth rate.
Wherein, the time window that the first time for belonging to the newsletter archive of a certain topic is occurred becomes with the topic is predicted The difference of the time window of hot topic is less than the event classification of specified threshold into emergency event specifically:
Two threshold values T1 and T2 are set, makes its similarity between text be greater than T1 if there is a certain theme, then recognizes Belong to this theme for current text and updates theme center vector;If being less than T2, then it is assumed that not corresponding with current text Theme, while create a new theme;If being less than T1 but being greater than T2, then it is assumed that current text is it is possible to belong to current master The similarity of topic, all newsletter archives relatively and in sort current text and this theme is recognized if its maximum value is greater than T1 Belong to this theme for current text, and the text is included into this theme, while updating theme center.
The beneficial effect of the technical scheme provided by the present invention is that:
The present invention using the assessment KAER of model penalty values in the training process to the effect of text vector dimensionality reduction, and and KATE It is compared.Experimental result is as shown in Figure 2, it can be seen that the experiment effect of KAER ratio KATE is more preferable.Passing through after training, The penalty values of KAER drop to 0.141 from the 0.599 of beginning, and the penalty values of KATE drop to 0.157 from the 0.624 of beginning,. With regard to input vector it is reconstitution for, compared with KATE, KAER can preferably reconstruct the input data of model.
In the detection of emergency event, the present invention is using improved Single-Pass method respectively to the vector after dimensionality reduction Cluster operation is carried out, and uses accuracy rate, recall rate and F value are assessed.The experimental results showed that for cluster result, The accuracy rate and recall rate of text vector after dimensionality reduction are apparently higher than before dimensionality reduction as a result, illustrating improved autocoder logical It, can also preferable learning text feature while mistake to text vector progress dimensionality reduction.
The present invention calculates topic hot value using the improved energy attenuation scheme based on RD, and compares the energy based on RD Attenuation schemes and the improved energy attenuation scheme based on RD, the curve difference that the hot value of same subject changes with time window As shown in Figure 3 and Figure 4.As can be seen that in the improved energy attenuation scheme based on RD, nutrition decay factor is with being currently located Time window in newsletter archive quantity and dynamic change, therefore be able to fully consider under time windows, newsletter archive number Amount has the factor of significant difference, the topic temperature curve of the topic is more smooth, is more in line with topic hot value in practice Situation of change.
The present invention uses improved focus incident and emergency event prediction technique based on growth rate, is judging a certain theme When whether being emergency event, which is become into the time window of focus incident and the newsletter archive for belonging to the topic occurs in first time Time window compare, prediction result shows that hot spot thing can be better anticipated in the improved prediction technique based on growth rate Part and emergency event.
Detailed description of the invention
Fig. 1 is a kind of flow chart of incident detection and prediction technique based on autocoder;
Fig. 2 is KATE (K- competes autocoder, K-Competitive Autoencoder) and KAER (improved K Compete autocoder model, K-Competitive Autoencoder with R-Threshold) penalty values comparison signal Figure;
Fig. 3 is that the topic temperature curve under the energy attenuation scheme based on RD (recurrence decaying, recursive decay) shows It is intended to;
Fig. 4 is the topic temperature curve synoptic diagram under the energy attenuation scheme based on improved RD.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further Ground detailed description.
Embodiment 1
In order to solve the problems, such as that background technique, the embodiment of the present invention propose a kind of prominent based on autocoder Event detection and prediction technique are sent out, referring to Fig. 1, method includes the following steps:
101: Chinese word segmentation being carried out to data text and deactivates processing;
102: by treated, data text is indicated with text vector, and carries out dimensionality reduction operation to text vector;
103: calculate the similarity between current text and each theme, and by the similarity between text and all themes from It is small to being ranked up greatly, take the maximum value of similarity to be compared with threshold value and judge the affiliated theme of current text or re-establish new Theme;
104: calculating topic hot value, time window and prediction that the first time of the newsletter archive of a certain topic occurs will be belonged to The difference that the topic becomes the time window of hot topic is less than the event classification of specified threshold into emergency event;
105: incident detection proposed by the present invention and prediction are measured by loss function, recall rate, accuracy rate and F value The accuracy of the technical matters of method.
In one embodiment, step 101 has carried out Chinese word segmentation to data text and has deactivated processing, and specific steps are such as Under:
The characteristics of according to Chinese word segmentation, the embodiment of the present invention carry out word segmentation processing using stammerer participle, and to the word divided Part of speech label is carried out, and the stop words in data is filtered out using deactivated dictionary after part of speech label.
In one embodiment, step 102 carries out text vector expression and dimensionality reduction operation, tool on the basis of step 101 Steps are as follows for body:
For having divided the newsletter archive of word, content of text is read in line by line, and whether judge in dictionary comprising current single Word, if comprising frequency of occurrence of the word in current text is primary from increasing;If not including, then it is assumed that current word Importance is lower, directly filters out the word, and so circulation is until running through all words in text.Finally, each news text Originally be expressed as from word composed by serial number and frequency of occurrence of the word in current text corresponding in dictionary to Amount.Dimensionality reduction operation competes autocoder model (K-Competitive Autoencoder with R- using improved K Threshold,KAER)。
In one embodiment, step 103 to pretreated newsletter archive calculate Topic Similarity, be maximized and with Threshold value comparison, the specific steps are as follows:
Corpus creation dictionary is first passed through, then the word in text is matched with dictionary, and the list of statistical match The sum of similarity of word, finally using the result of statistics as the degree of correlation between two texts.By calculating theme and news Cosine similarity between text is ranked up as similarity and from small to large, is maximized.
Whether the acquired similarity maximum value of judgement is greater than specified threshold, if being less than specified threshold, then it is assumed that when above Originally it is not belonging to any one current topic, needs to re-establish a new theme at this time, and current text is categorized into and is created In the theme built;Conversely, then current text is categorized into this theme and updates the center vector of theme.
In one embodiment, step 104 calculates topic hot value, predicts emergency event, the specific steps are as follows:
Use the improved energy attenuation scheme (Recursive Decay with Dynamic Timesolt) based on RD Topic hot value is calculated, and predicts emergency event using improved prediction technique based on growth rate, is judging that a certain theme is It is not no when being emergency event, which is become into the time window of focus incident and for the first time the newsletter archive for belonging to the topic occurs Time window compares.
In one embodiment, the standard of this method is measured in step 105 using loss function, recall rate, accuracy rate and F value True degree, the specific steps are as follows:
The extent of damage of the information as caused by compression, statistic mixed-state and prediction result are calculated by loss function With actual result, recall rate, accuracy rate and F value is calculated, measures order of accuarcy of the invention.
In conclusion the vector of document is tieed up after the embodiment of the present invention can effectively overcome and convert vector for news documents Degree is higher, and has the problem of sparsity, while more adapting to newsletter archive quantity in practice and changing over time and change Situation.
Embodiment 2
The scheme in embodiment 1 is further introduced below with reference to specific calculation formula, described below:
201: in the analytic process of emergency event, the participle of progress text is handled with deactivated first, is used in the process Stammerer participle;
In terms of participle mode, stammerer participle uses accurate model, this mode is used for the text dividing in mass data, Under this mode, each sentence is cut into word as correctly as possible.
202: part of speech label being carried out for the word divided, is filtered out in data and is stopped using deactivated dictionary after part of speech label Word;
Wherein, the word divided is for example: noun, verb, adjective etc..
203: realizing that the vector of newsletter archive is indicated based on the method for dictionary;
After constructing dictionary, the word in dictionary is ranked up and each word is numbered.In statistics word After the frequency of occurrence in Present News text, logarithmic function standardized text vector, the expression of text vector such as formula are used (1) shown in.
Wherein, xiIndicate the vector form of newsletter archive, V indicates dictionary, niIndicate word i in Present News text There is quantity.
204: competing autocoder model (K-Competitive Autoencoder with R- using improved K Threshold, KAER) realize that the dimensionality reduction of text operates;
Wherein, text vector has the characteristics that high-dimensional, sparsity, the embodiment of the present invention realize text using improved KAER This dimensionality reduction operation.Method by the way that threshold value R is added in hidden layer, when the absolute value of the energy value of the failure node in hidden layer When greater than given threshold value R, then regards the node as the study to autoencoder network effective, it is disregarded;And When the activation value of failure neuron is less than given threshold value R, the activation value of failure neuron is added to successfully on neuron, and will The value of failure neuron is set to zero.The value of threshold value R in the embodiment of the present invention is 0.1.
205: the similarity between newsletter archive and theme is calculated using cosine similarity, the calculating of cosine similarity Method, as shown in formula (2).
Wherein, D1And D2Respectively indicate vector representation of the newsletter archive based on dictionary.
206: the topic detection of newsletter archive uses improved Single-Pass algorithm, then according under each theme The quantity of newsletter archive is ranked up cluster result, passes through the result of the method assessment cluster of hand labeled.
In improved Single-Pass clustering method, two threshold values T1 and T2 are set, are made if there is a certain theme Its similarity between text is greater than T1, then it is assumed that current text belongs to this theme and updates theme center vector;If being less than T2, then it is assumed that theme not corresponding with current text, while creating a new theme;If being less than T1 but being greater than T2, Think current text it is possible to belonging to current topic, relatively and all newsletter archives in sort current text and this theme Similarity, if its maximum value be greater than T1, then it is assumed that current text belongs to this theme, and the text is included into this theme, Theme center is updated simultaneously.
207: calculating in hot value, indicate the state of development of topic with energy value, and with the variation prediction of the energy value words Inscribe possible life cycle;
Wherein, for any theme V in t-th of time window, x is enabledtIndicate all categories in the theme and specified time The sum of similarity between the text of this theme.In t moment, shown in the energy value of topic such as formula (3).
yt=g (x1,...,xt,α,β) (3)
Wherein, xiIt represents similar between a certain designated key and the newsletter archive for belonging to the theme in i-th of time window The sum of degree, α and β are two parameters in life cycle model, 0≤α≤1,0≤β≤1, α expression nutrition conversion factor, α decision XiThe ratio of the nutritive value of topic can be contributed to, β indicates nutrition decay factor, and β determines the theme in each time window Energy the case where developing at any time and gradually decreasing.
In actual news briefing scene, the quantity of news is influenced by time factor, on the one hand, on weekdays The news quantity of publication is apparently higher than the news quantity issued at weekend, on the other hand, the news issued in special event Quantity is apparently higher than the news quantity issued usually, for example world cup or somewhere occurred in the period of disaster event, by It is directly related with the newsletter archive quantity of the topic is belonged in the energy value of topic, and in the energy attenuation scheme based on RD, Not in view of influence of the time factor to topic temperature in reality, the present invention proposes dynamic nutrition decay factor, fills Divide the influence for considering time interval to topic temperature, as shown in formula (4):
βi=β * log (1.0+ni/avg) (4)
Wherein, niIndicate the newsletter archive quantity in i-th of time window, avg indicates the News Network in a time window Stand publication newsletter archive par, βiIndicate the nutrition decay factor in i-th of timeslice, β indicates the energy based on RD Measure nutrition decay factor calculated in attenuation schemes.
Wherein, energy function F (y is definedt) it is used to calculate the temperature of topic, the independent variable of the energy function is the energy of topic Magnitude, energy function need to meet the following conditions, as shown in formula (5):
Wherein, ytTheme is represented in the energy value size of t moment, F (yt) for theme energy to be normalized. Shown in the calculation method of energy function such as formula (6).
F(r·yt)=s r > 0, s < 1 (6)
Wherein, ytTopic is indicated in the energy value of t moment, above-mentioned formula can be construed to when the text for belonging to a topic When shared percentage is r, the topic hot value that energy function returns is s.
208: using improved prediction technique (the Improved method of based on growth rate Forecastingbased on rate), shown in the growth rate curve such as formula (7) for defining theme hot value.
Dt=G [F (yt)],F(yt)∈[F(yA),F(yB)] (7)
Wherein, A represents the time point of maximum rate of growth when theme temperature curve is in build phase, and B indicates theme heat Line write music by increasing the time point tended to be steady, F (yA) and F (yB) topic is respectively indicated in the topic hot value of A point and B point.
Meanwhile for the fluctuation problem of the growth rate of topic hot value as caused by time window, using formula (8) to increasing Long rate is smoothed.
Wherein, DtRepresent real growth rate of the theme on time window t, δ is one group of empirical value, value be [32,24,16, 8,4], δiIndicate smoothing factor corresponding to the topic hot value growth rate at corresponding time window t.
209: the extent of damage of the information as caused by compression being calculated by loss function, inspection is calculated by statistics Survey and prediction result and actual result, obtain recall rate, accuracy rate and F value, measure the order of accuarcy of this method.
In conclusion a kind of incident detection based on autocoder and the prediction side of design of the embodiment of the present invention Method is indicated the text vector after dimensionality reduction using neuron in the hidden layer of autocoder model, and utilizes improved energy It measures attenuation schemes (formula 4), it is made more to adapt to the case where newsletter archive quantity is changed over time and changed in practice, it can Using news information existing in network come look-ahead focus incident and emergency event, when sufficient pretreatment can be provided Between preferably to maintain social stability.
Embodiment 3
Feasibility verifying is carried out to the scheme in Examples 1 and 2 below with reference to Fig. 2-Fig. 4, described below:
Indicate whether significant semantically to assess the word of KAER capture, experimental verification is in vector space model[1] In similar or related word it is whether closer to each other.For giving the word of theme, its corresponding serial number, root in dictionary is obtained The word corresponding weight in a model is obtained according to obtained serial number, and then the vector according to similar in weight is found and the specified list The similar or related word of word.For the similitude of word, compared with KATE, KAER may learn the higher list of similitude Word.
Mould in the training process is compared in the difference between text vector and input vector in order to assess KAER reconstruct, experiment The situation of change of type penalty values.In given input vector xiIn the case where, the output vector of modelWith the loss letter between it Shown in number calculating method such as formula (9).
Wherein, V is the number of text vector,Indicate text vector x after improved autocoderiReconstruct to Amount.With regard to input vector it is reconstitution for, compared with KATE, KAER can preferably reconstruct the input data of model.
Meanwhile the embodiment of the present invention uses recall rate (Recall), accuracy rate (Precision) and F value (F- Measure) evaluation criterion as detection scoring, accuracy rate can be understood as the institute that correctly predicted sample size accounts for prediction There is the ratio of sample size, be mainly used to the accuracy of measure algorithm, as shown in formula (10).Recall rate can be understood as predicting Related text quantity and corpus in related text quantity ratio, as shown in formula (11).F value can be understood as accuracy With the harmonic mean of recall rate, as shown in formula (12).
Wherein, the textual data in a certain theme that m expression clustering algorithm correctly detects, n expression cluster in the theme What algorithm should actually detect belongs to the textual data of the theme.
Wherein, the textual data in a certain theme that m expression clustering algorithm correctly detects, t are indicated in specified text set What middle clustering algorithm should actually check belongs to the textual data of the theme.
Wherein, P indicates accuracy rate, and R indicates recall rate.
The experimental results showed that the accuracy rate and recall rate of the text vector after dimensionality reduction are apparently higher than text for cluster result It is before this dimensionality reduction as a result, illustrate improved autocoder by text vector carry out dimensionality reduction to improve the same of efficiency of algorithm When, text feature has been arrived in preferably study.For example, the topic of millet listing, since in a short time, listing problem causes extensively Concern, therefore the accuracy rate of the topic and recall rate are all higher;The topic of the Changjiang river No. 1 flood in 2018, due in the time In section, the rainfall in domestic multiple areas is all larger, and the words such as " heavy rain ", " flood ", " injures and deaths " is caused to be frequent, and reduces The specific gravity of regional keyword, therefore the accuracy rate of cluster result is lower, since analog result is all considered as same topic by model, because This recall rate is higher.
Prediction technique assessment for improved growth rate, Comprehensive Experiment result and analysis are it is found that when the appearance of event Between and time for developing therewith all in the time range of the data set of selection when, can be judged according to the state of development of event Whether it is focus incident or emergency event, and before the time of the data set of selection, the event have developed into order to Focus incident, in this case, model are not very ideal the emergency case prediction of this kind of event.In conclusion changing Into the prediction technique based on growth rate can preferably predict focus incident and emergency event.
In Fig. 2, the experiment effect of KAER ratio KATE is more preferable.In KAER, by after training, penalty values are from the beginning of 0.599 drop to 0.141;And in KATE, penalty values 0.624 drop to 0.157 by.Therefore, with regard to input vector It is reconstitution for, compared with KATE, KAER can preferably reconstruct the input data of model.
In Fig. 3, under the energy attenuation scheme based on RD, by training, the value of nutrition decay factor is 0.0024, Since nutrition decay factor is a fixed value, do not distinguished significantly in view of newsletter archive quantity in different time sections has Factor, therefore, the topic hot value curvilinear motion of the topic is more lofty, and the sawtooth in curve is more, is unfavorable for next The prediction technique of growth rate based on topic temperature curve.
In Fig. 4, in the improved energy attenuation scheme based on RD, nutrition decay factor is with the time window being currently located The quantity of middle newsletter archive and dynamic change, therefore be able to fully consider under time windows, newsletter archive quantity has obviously The topic temperature curve of the factor of difference, the topic is more smooth, is also therefore more in line with the change of topic hot value in practice Change situation.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (6)

1. a kind of incident detection and prediction technique based on autocoder, which is characterized in that the method includes following Step:
Chinese word segmentation is carried out to data text and deactivates processing;By treated, data text is indicated with text vector, and to text This vector carries out dimensionality reduction operation;
Calculate the similarity between current text and each theme, and by the similarity between text and all themes from small to large into Row sequence takes the maximum value of similarity to be compared with threshold value and judges the affiliated theme of current text or re-establish new theme;
Calculate topic hot value, will belong to the newsletter archive of a certain topic first time occur time window and predict the topic at It is less than the event classification of specified threshold into emergency event for the difference of the time window of hot topic.
2. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist In the calculating topic hot value specifically: the energy attenuation based on RD calculates topic hot value.
3. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist In, the method also includes:
Emergency event is predicted based on growth rate, and when whether judge a certain theme is emergency event, which is become into hot spot thing The time window that the time window of part the newsletter archive for belonging to the topic occurs with first time compares.
4. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist In described to carry out dimensionality reduction operation to text vector specifically:
Threshold value R is added in hidden layer, when the absolute value of the energy value of the failure node in hidden layer is greater than given threshold value R, Then regard the node as the study to autoencoder network effective;
When the activation value of failure neuron is less than given threshold value R, the activation value of failure neuron is added to successfully on neuron, And the value of failure neuron is set to zero.
5. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist In described to predict emergency event based on growth rate specifically:
Define the growth rate curve of theme hot value:
Dt=G [F (yt)],F(yt)∈[F(yA),F(yB)]
Wherein, A represents the time point of maximum rate of growth when theme temperature curve is in build phase, and B indicates that theme temperature is bent Line is by increasing the time point tended to be steady, F (yA) and F (yB) topic is respectively indicated in the topic hot value of A point and B point;
Meanwhile for the fluctuation problem of the growth rate of topic hot value as caused by time window, using following formula to growth rate into Row smoothing processing;
Wherein, DtRepresent real growth rate of the theme on time window t, δiIndicate the topic hot value at corresponding time window t Smoothing factor corresponding to growth rate.
6. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist In the time window and predict that the topic becomes hot topic that the first time of the newsletter archive that will belong to a certain topic occurs The difference of time window is less than the event classification of specified threshold into emergency event specifically:
Two threshold values T1 and T2 are set, make its similarity between text be greater than T1 if there is a certain theme, then it is assumed that when Preceding text belongs to this theme and updates theme center vector;If being less than T2, then it is assumed that master not corresponding with current text Topic, while creating a new theme;If being less than T1 but being greater than T2, then it is assumed that current text it is possible to belong to current topic, The similarity of all newsletter archives relatively and in sort current text and this theme, if its maximum value is greater than T1, then it is assumed that Current text belongs to this theme, and the text is included into this theme, while updating theme center.
CN201910401627.0A 2019-05-14 2019-05-14 A kind of incident detection and prediction technique based on autocoder Pending CN110209813A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910401627.0A CN110209813A (en) 2019-05-14 2019-05-14 A kind of incident detection and prediction technique based on autocoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910401627.0A CN110209813A (en) 2019-05-14 2019-05-14 A kind of incident detection and prediction technique based on autocoder

Publications (1)

Publication Number Publication Date
CN110209813A true CN110209813A (en) 2019-09-06

Family

ID=67787220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910401627.0A Pending CN110209813A (en) 2019-05-14 2019-05-14 A kind of incident detection and prediction technique based on autocoder

Country Status (1)

Country Link
CN (1) CN110209813A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158079A (en) * 2021-04-22 2021-07-23 昆明理工大学 Case public opinion timeline generation method based on difference case elements
CN113987192A (en) * 2021-12-28 2022-01-28 中国电子科技网络信息安全有限公司 Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method
WO2018086518A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for real-time detection of new subject
CN108805167A (en) * 2018-05-04 2018-11-13 江南大学 L aplace function constraint-based sparse depth confidence network image classification method
CN108932311A (en) * 2018-06-20 2018-12-04 天津大学 The method of incident detection and prediction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method
WO2018086518A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for real-time detection of new subject
CN108805167A (en) * 2018-05-04 2018-11-13 江南大学 L aplace function constraint-based sparse depth confidence network image classification method
CN108932311A (en) * 2018-06-20 2018-12-04 天津大学 The method of incident detection and prediction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孙红光等: "基于改进Single-Pass算法的网络新闻话题发现", 《吉林大学学报(理学版)》 *
林榆旺: "突发事件检测和预测方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158079A (en) * 2021-04-22 2021-07-23 昆明理工大学 Case public opinion timeline generation method based on difference case elements
CN113987192A (en) * 2021-12-28 2022-01-28 中国电子科技网络信息安全有限公司 Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm
CN113987192B (en) * 2021-12-28 2022-04-01 中国电子科技网络信息安全有限公司 Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm

Similar Documents

Publication Publication Date Title
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN110046250A (en) Three embedded convolutional neural networks model and its more classification methods of text
Jiang et al. Fake news detection using deep recurrent neural networks
CN110889003B (en) Vehicle image fine-grained retrieval system based on text
CN110209813A (en) A kind of incident detection and prediction technique based on autocoder
Al-Omari et al. JUSTDeep at NLP4IF 2019 task 1: Propaganda detection using ensemble deep learning models
CN111339440B (en) Social emotion sequencing method based on hierarchical state neural network for news text
Tao et al. News text classification based on an improved convolutional neural network
Khalid et al. Topic detection from conversational dialogue corpus with parallel dirichlet allocation model and elbow method
Das et al. Group incremental adaptive clustering based on neural network and rough set theory for crime report categorization
Wang et al. Chinese news text classification based on attention-based CNN-BiLSTM
CN113806528A (en) Topic detection method and device based on BERT model and storage medium
Guo Intelligent sports video classification based on deep neural network (DNN) algorithm and transfer learning
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN110457685A (en) A kind of Chinese business Text Pretreatment method based on machine learning
CN110348497A (en) A kind of document representation method based on the building of WT-GloVe term vector
CN112926340B (en) Semantic matching model for knowledge point positioning
Sheela et al. Caviar-Sunflower Optimization Algorithm-Based Deep Learning Classifier for Multi-Document Summarization
Yi et al. Machine learning algorithms with co-occurrence based term association for text mining
Guo et al. Ernie-bilstm based Chinese text sentiment classification method
Zhou et al. Hierarchical attention-based fuzzy neural network for subject classification of power customer service work orders
Allawadi et al. Multimedia data summarization using joint integer linear programming
Chen et al. Pseudo-supervised approach for text clustering based on consensus analysis
Gaozheng et al. Research on SVM Fault Diagnosis Method Based on Text Feature Extraction Algorithm
Wei et al. Multi-Label Text Classification Model Based on Multi-Level Constraint Augmentation and Label Association Attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190906

RJ01 Rejection of invention patent application after publication