CN110209813A - A kind of incident detection and prediction technique based on autocoder - Google Patents
A kind of incident detection and prediction technique based on autocoder Download PDFInfo
- Publication number
- CN110209813A CN110209813A CN201910401627.0A CN201910401627A CN110209813A CN 110209813 A CN110209813 A CN 110209813A CN 201910401627 A CN201910401627 A CN 201910401627A CN 110209813 A CN110209813 A CN 110209813A
- Authority
- CN
- China
- Prior art keywords
- topic
- text
- theme
- value
- time window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000001514 detection method Methods 0.000 title claims abstract description 21
- 230000009467 reduction Effects 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 7
- 230000011218 segmentation Effects 0.000 claims abstract description 7
- 210000002569 neuron Anatomy 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 6
- 238000009499 grossing Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 description 11
- 230000035764 nutrition Effects 0.000 description 9
- 235000016709 nutrition Nutrition 0.000 description 9
- 230000008859 change Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 244000062793 Sorghum vulgare Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003466 anti-cipated effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000034994 death Effects 0.000 description 1
- 231100000517 death Toxicity 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 235000019713 millet Nutrition 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000050 nutritive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of incident detection and prediction technique based on autocoder, the described method comprises the following steps: carrying out Chinese word segmentation to data text and deactivate processing;By treated, data text is indicated with text vector, and carries out dimensionality reduction operation to text vector;The similarity between current text and each theme is calculated, and the similarity between text and all themes is ranked up from small to large, takes the maximum value of similarity to be compared with threshold value and judges the affiliated theme of current text or re-establish new theme;Topic hot value is calculated, the time window of the first time appearance of the newsletter archive of a certain topic will be belonged to and predicts that the topic is less than the event classification of specified threshold into emergency event as the difference of the time window of hot topic.The present invention can effectively overcome convert vector for news documents after, the vector dimension of document is higher, and has the problem of sparsity, while more adapting to the case where newsletter archive quantity is changed over time and changed in practice.
Description
Technical field
The present invention relates to text detection and prediction field more particularly to a kind of incident detections based on autocoder
With prediction technique.
Background technique
At present in the related art, the text representation technology of emergency event is mainly autocoder (AutoEncoder).
In deep learning, autocoder is used for the training stage, carries out Feature Conversion to the data of input, that is, encodes the data to another
Then a kind of form carries out a series of study on this basis.The essence of autocoder is the network node using hidden layer
The neuron of input layer is reconstructed, even if the output of neural network is similar to the input information of network as much as possible, trained
In the process, loss function is continued to optimize using the method for backpropagation to obtain smaller penalty values.Due to neural in hidden layer
Competition between member, each neuron becomes specially identifying specific data pattern, so as a complete unit, autocoder can
Significant text representation is arrived with study.In image data set representations field, autocoder, which has been obtained, to be widely applied
And relatively good effect is obtained.But since text is sufficiently complex, such as: high-dimensional, sparsity and the distribution of power law word etc.,
Traditional autocoder may be more likely to the simple expression of learning text, their performances on text data set are not yet
To extensive research.
The detection technique of emergency event is mainly Single-Pass clustering technique.Single-Pass clustering algorithm it is main
Thought is the successively matching degree to input text to determine between current text and existing cluster result.If current text with
Current text, then be attributed to wherein, otherwise will create new gathering by existing result matching.However, in Single-Pass algorithm
In, the sequence of the text of input model will directly influence the final effect of cluster, if moreover, traditional single threshold value was arranged
Height will cause cluster result granularity it is too small so that reduce cluster result recall rate, if threshold value setting it is excessively high will cause cluster knot
The granularity of fruit is too big and then reduces the accuracy rate of cluster result.
The Predicting Technique of emergency event is mainly based upon the prediction technique of the growth rate of topic hot value.Utilize topic temperature
The growth rate of value window variation at any time predicts focus incident and emergency event, specifies topic heat degree threshold FThre, first derivative
Threshold value DThre1 and second dervative threshold value DThre2 calculates the theme temperature when topic hot value is greater than threshold value FThre first
First derivative, if the first derivative of the theme be greater than DThre1, and the second dervative of the theme be not less than Dthre2, then it is assumed that
The topic is focus incident, and otherwise the topic is not belonging to focus incident.But there are obvious shortcomings for this method, because of the temperature of topic
Value is calculated based on time window, so the hot value of topic is not continuous, and in a practical situation, due in network
The sudden and complexity of emergency event, the hot value of topic fluctuate up and down, and topic hot value is the variation of window at any time
And fluctuate, and in the prediction technique preferably based on growth rate, the single order growth rate of Default Subject hot value only has
One highest point.Therefore, this method is unsatisfactory to the prediction effect of focus incident and emergency event in practical application.
Summary of the invention
The present invention provides a kind of incident detection and prediction technique based on autocoder, the present invention can be effective
Overcome after converting vector for news documents, the vector dimension of document is higher, and has the problem of sparsity, while more adapting to real
The case where newsletter archive quantity is changed over time and is changed in border, described below:
A kind of incident detection and prediction technique based on autocoder, the described method comprises the following steps:
Chinese word segmentation is carried out to data text and deactivates processing;By treated, data text is indicated with text vector, and
Dimensionality reduction operation is carried out to text vector;
Calculate the similarity between current text and each theme, and by the similarity between text and all themes from it is small to
It is ranked up greatly, takes the maximum value of similarity to be compared with threshold value and judge the affiliated theme of current text or re-establish new master
Topic;
Topic hot value is calculated, time window and the prediction words that the first time of the newsletter archive of a certain topic occurs will be belonged to
The difference that topic becomes the time window of hot topic is less than the event classification of specified threshold into emergency event.
Wherein, the calculating topic hot value specifically: the energy attenuation based on RD calculates topic hot value.
Further, the method also includes:
Emergency event is predicted based on growth rate, and when whether judge a certain theme is emergency event, which is become into heat
The time window that the time window of point event the newsletter archive for belonging to the topic occurs with first time compares.
It is wherein, described that dimensionality reduction operation is carried out to text vector specifically:
Threshold value R is added in hidden layer, when the absolute value of the energy value of the failure node in hidden layer is greater than given threshold value R
When, then regard the node as the study to autoencoder network effective;
When the activation value of failure neuron is less than given threshold value R, the activation value of failure neuron is added to successfully neural
In member, and the value of failure neuron is set to zero.
It is further, described that emergency event is predicted based on growth rate specifically:
Define the growth rate curve of theme hot value:
Dt=G [F (yt)],F(yt)∈[F(yA),F(yB)]
Wherein, A represents the time point of maximum rate of growth when theme temperature curve is in build phase, and B indicates theme heat
Line write music by increasing the time point tended to be steady, F (yA) and F (yB) topic is respectively indicated in the topic hot value of A point and B point;
Meanwhile for the fluctuation problem of the growth rate of topic hot value as caused by time window, using following formula to growth
Rate is smoothed;
Wherein, DtRepresent real growth rate of the theme on time window t, δiIndicate the topic heat at corresponding time window t
Smoothing factor corresponding to angle value growth rate.
Wherein, the time window that the first time for belonging to the newsletter archive of a certain topic is occurred becomes with the topic is predicted
The difference of the time window of hot topic is less than the event classification of specified threshold into emergency event specifically:
Two threshold values T1 and T2 are set, makes its similarity between text be greater than T1 if there is a certain theme, then recognizes
Belong to this theme for current text and updates theme center vector;If being less than T2, then it is assumed that not corresponding with current text
Theme, while create a new theme;If being less than T1 but being greater than T2, then it is assumed that current text is it is possible to belong to current master
The similarity of topic, all newsletter archives relatively and in sort current text and this theme is recognized if its maximum value is greater than T1
Belong to this theme for current text, and the text is included into this theme, while updating theme center.
The beneficial effect of the technical scheme provided by the present invention is that:
The present invention using the assessment KAER of model penalty values in the training process to the effect of text vector dimensionality reduction, and and KATE
It is compared.Experimental result is as shown in Figure 2, it can be seen that the experiment effect of KAER ratio KATE is more preferable.Passing through after training,
The penalty values of KAER drop to 0.141 from the 0.599 of beginning, and the penalty values of KATE drop to 0.157 from the 0.624 of beginning,.
With regard to input vector it is reconstitution for, compared with KATE, KAER can preferably reconstruct the input data of model.
In the detection of emergency event, the present invention is using improved Single-Pass method respectively to the vector after dimensionality reduction
Cluster operation is carried out, and uses accuracy rate, recall rate and F value are assessed.The experimental results showed that for cluster result,
The accuracy rate and recall rate of text vector after dimensionality reduction are apparently higher than before dimensionality reduction as a result, illustrating improved autocoder logical
It, can also preferable learning text feature while mistake to text vector progress dimensionality reduction.
The present invention calculates topic hot value using the improved energy attenuation scheme based on RD, and compares the energy based on RD
Attenuation schemes and the improved energy attenuation scheme based on RD, the curve difference that the hot value of same subject changes with time window
As shown in Figure 3 and Figure 4.As can be seen that in the improved energy attenuation scheme based on RD, nutrition decay factor is with being currently located
Time window in newsletter archive quantity and dynamic change, therefore be able to fully consider under time windows, newsletter archive number
Amount has the factor of significant difference, the topic temperature curve of the topic is more smooth, is more in line with topic hot value in practice
Situation of change.
The present invention uses improved focus incident and emergency event prediction technique based on growth rate, is judging a certain theme
When whether being emergency event, which is become into the time window of focus incident and the newsletter archive for belonging to the topic occurs in first time
Time window compare, prediction result shows that hot spot thing can be better anticipated in the improved prediction technique based on growth rate
Part and emergency event.
Detailed description of the invention
Fig. 1 is a kind of flow chart of incident detection and prediction technique based on autocoder;
Fig. 2 is KATE (K- competes autocoder, K-Competitive Autoencoder) and KAER (improved K
Compete autocoder model, K-Competitive Autoencoder with R-Threshold) penalty values comparison signal
Figure;
Fig. 3 is that the topic temperature curve under the energy attenuation scheme based on RD (recurrence decaying, recursive decay) shows
It is intended to;
Fig. 4 is the topic temperature curve synoptic diagram under the energy attenuation scheme based on improved RD.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further
Ground detailed description.
Embodiment 1
In order to solve the problems, such as that background technique, the embodiment of the present invention propose a kind of prominent based on autocoder
Event detection and prediction technique are sent out, referring to Fig. 1, method includes the following steps:
101: Chinese word segmentation being carried out to data text and deactivates processing;
102: by treated, data text is indicated with text vector, and carries out dimensionality reduction operation to text vector;
103: calculate the similarity between current text and each theme, and by the similarity between text and all themes from
It is small to being ranked up greatly, take the maximum value of similarity to be compared with threshold value and judge the affiliated theme of current text or re-establish new
Theme;
104: calculating topic hot value, time window and prediction that the first time of the newsletter archive of a certain topic occurs will be belonged to
The difference that the topic becomes the time window of hot topic is less than the event classification of specified threshold into emergency event;
105: incident detection proposed by the present invention and prediction are measured by loss function, recall rate, accuracy rate and F value
The accuracy of the technical matters of method.
In one embodiment, step 101 has carried out Chinese word segmentation to data text and has deactivated processing, and specific steps are such as
Under:
The characteristics of according to Chinese word segmentation, the embodiment of the present invention carry out word segmentation processing using stammerer participle, and to the word divided
Part of speech label is carried out, and the stop words in data is filtered out using deactivated dictionary after part of speech label.
In one embodiment, step 102 carries out text vector expression and dimensionality reduction operation, tool on the basis of step 101
Steps are as follows for body:
For having divided the newsletter archive of word, content of text is read in line by line, and whether judge in dictionary comprising current single
Word, if comprising frequency of occurrence of the word in current text is primary from increasing;If not including, then it is assumed that current word
Importance is lower, directly filters out the word, and so circulation is until running through all words in text.Finally, each news text
Originally be expressed as from word composed by serial number and frequency of occurrence of the word in current text corresponding in dictionary to
Amount.Dimensionality reduction operation competes autocoder model (K-Competitive Autoencoder with R- using improved K
Threshold,KAER)。
In one embodiment, step 103 to pretreated newsletter archive calculate Topic Similarity, be maximized and with
Threshold value comparison, the specific steps are as follows:
Corpus creation dictionary is first passed through, then the word in text is matched with dictionary, and the list of statistical match
The sum of similarity of word, finally using the result of statistics as the degree of correlation between two texts.By calculating theme and news
Cosine similarity between text is ranked up as similarity and from small to large, is maximized.
Whether the acquired similarity maximum value of judgement is greater than specified threshold, if being less than specified threshold, then it is assumed that when above
Originally it is not belonging to any one current topic, needs to re-establish a new theme at this time, and current text is categorized into and is created
In the theme built;Conversely, then current text is categorized into this theme and updates the center vector of theme.
In one embodiment, step 104 calculates topic hot value, predicts emergency event, the specific steps are as follows:
Use the improved energy attenuation scheme (Recursive Decay with Dynamic Timesolt) based on RD
Topic hot value is calculated, and predicts emergency event using improved prediction technique based on growth rate, is judging that a certain theme is
It is not no when being emergency event, which is become into the time window of focus incident and for the first time the newsletter archive for belonging to the topic occurs
Time window compares.
In one embodiment, the standard of this method is measured in step 105 using loss function, recall rate, accuracy rate and F value
True degree, the specific steps are as follows:
The extent of damage of the information as caused by compression, statistic mixed-state and prediction result are calculated by loss function
With actual result, recall rate, accuracy rate and F value is calculated, measures order of accuarcy of the invention.
In conclusion the vector of document is tieed up after the embodiment of the present invention can effectively overcome and convert vector for news documents
Degree is higher, and has the problem of sparsity, while more adapting to newsletter archive quantity in practice and changing over time and change
Situation.
Embodiment 2
The scheme in embodiment 1 is further introduced below with reference to specific calculation formula, described below:
201: in the analytic process of emergency event, the participle of progress text is handled with deactivated first, is used in the process
Stammerer participle;
In terms of participle mode, stammerer participle uses accurate model, this mode is used for the text dividing in mass data,
Under this mode, each sentence is cut into word as correctly as possible.
202: part of speech label being carried out for the word divided, is filtered out in data and is stopped using deactivated dictionary after part of speech label
Word;
Wherein, the word divided is for example: noun, verb, adjective etc..
203: realizing that the vector of newsletter archive is indicated based on the method for dictionary;
After constructing dictionary, the word in dictionary is ranked up and each word is numbered.In statistics word
After the frequency of occurrence in Present News text, logarithmic function standardized text vector, the expression of text vector such as formula are used
(1) shown in.
Wherein, xiIndicate the vector form of newsletter archive, V indicates dictionary, niIndicate word i in Present News text
There is quantity.
204: competing autocoder model (K-Competitive Autoencoder with R- using improved K
Threshold, KAER) realize that the dimensionality reduction of text operates;
Wherein, text vector has the characteristics that high-dimensional, sparsity, the embodiment of the present invention realize text using improved KAER
This dimensionality reduction operation.Method by the way that threshold value R is added in hidden layer, when the absolute value of the energy value of the failure node in hidden layer
When greater than given threshold value R, then regards the node as the study to autoencoder network effective, it is disregarded;And
When the activation value of failure neuron is less than given threshold value R, the activation value of failure neuron is added to successfully on neuron, and will
The value of failure neuron is set to zero.The value of threshold value R in the embodiment of the present invention is 0.1.
205: the similarity between newsletter archive and theme is calculated using cosine similarity, the calculating of cosine similarity
Method, as shown in formula (2).
Wherein, D1And D2Respectively indicate vector representation of the newsletter archive based on dictionary.
206: the topic detection of newsletter archive uses improved Single-Pass algorithm, then according under each theme
The quantity of newsletter archive is ranked up cluster result, passes through the result of the method assessment cluster of hand labeled.
In improved Single-Pass clustering method, two threshold values T1 and T2 are set, are made if there is a certain theme
Its similarity between text is greater than T1, then it is assumed that current text belongs to this theme and updates theme center vector;If being less than
T2, then it is assumed that theme not corresponding with current text, while creating a new theme;If being less than T1 but being greater than T2,
Think current text it is possible to belonging to current topic, relatively and all newsletter archives in sort current text and this theme
Similarity, if its maximum value be greater than T1, then it is assumed that current text belongs to this theme, and the text is included into this theme,
Theme center is updated simultaneously.
207: calculating in hot value, indicate the state of development of topic with energy value, and with the variation prediction of the energy value words
Inscribe possible life cycle;
Wherein, for any theme V in t-th of time window, x is enabledtIndicate all categories in the theme and specified time
The sum of similarity between the text of this theme.In t moment, shown in the energy value of topic such as formula (3).
yt=g (x1,...,xt,α,β) (3)
Wherein, xiIt represents similar between a certain designated key and the newsletter archive for belonging to the theme in i-th of time window
The sum of degree, α and β are two parameters in life cycle model, 0≤α≤1,0≤β≤1, α expression nutrition conversion factor, α decision
XiThe ratio of the nutritive value of topic can be contributed to, β indicates nutrition decay factor, and β determines the theme in each time window
Energy the case where developing at any time and gradually decreasing.
In actual news briefing scene, the quantity of news is influenced by time factor, on the one hand, on weekdays
The news quantity of publication is apparently higher than the news quantity issued at weekend, on the other hand, the news issued in special event
Quantity is apparently higher than the news quantity issued usually, for example world cup or somewhere occurred in the period of disaster event, by
It is directly related with the newsletter archive quantity of the topic is belonged in the energy value of topic, and in the energy attenuation scheme based on RD,
Not in view of influence of the time factor to topic temperature in reality, the present invention proposes dynamic nutrition decay factor, fills
Divide the influence for considering time interval to topic temperature, as shown in formula (4):
βi=β * log (1.0+ni/avg) (4)
Wherein, niIndicate the newsletter archive quantity in i-th of time window, avg indicates the News Network in a time window
Stand publication newsletter archive par, βiIndicate the nutrition decay factor in i-th of timeslice, β indicates the energy based on RD
Measure nutrition decay factor calculated in attenuation schemes.
Wherein, energy function F (y is definedt) it is used to calculate the temperature of topic, the independent variable of the energy function is the energy of topic
Magnitude, energy function need to meet the following conditions, as shown in formula (5):
Wherein, ytTheme is represented in the energy value size of t moment, F (yt) for theme energy to be normalized.
Shown in the calculation method of energy function such as formula (6).
F(r·yt)=s r > 0, s < 1 (6)
Wherein, ytTopic is indicated in the energy value of t moment, above-mentioned formula can be construed to when the text for belonging to a topic
When shared percentage is r, the topic hot value that energy function returns is s.
208: using improved prediction technique (the Improved method of based on growth rate
Forecastingbased on rate), shown in the growth rate curve such as formula (7) for defining theme hot value.
Dt=G [F (yt)],F(yt)∈[F(yA),F(yB)] (7)
Wherein, A represents the time point of maximum rate of growth when theme temperature curve is in build phase, and B indicates theme heat
Line write music by increasing the time point tended to be steady, F (yA) and F (yB) topic is respectively indicated in the topic hot value of A point and B point.
Meanwhile for the fluctuation problem of the growth rate of topic hot value as caused by time window, using formula (8) to increasing
Long rate is smoothed.
Wherein, DtRepresent real growth rate of the theme on time window t, δ is one group of empirical value, value be [32,24,16,
8,4], δiIndicate smoothing factor corresponding to the topic hot value growth rate at corresponding time window t.
209: the extent of damage of the information as caused by compression being calculated by loss function, inspection is calculated by statistics
Survey and prediction result and actual result, obtain recall rate, accuracy rate and F value, measure the order of accuarcy of this method.
In conclusion a kind of incident detection based on autocoder and the prediction side of design of the embodiment of the present invention
Method is indicated the text vector after dimensionality reduction using neuron in the hidden layer of autocoder model, and utilizes improved energy
It measures attenuation schemes (formula 4), it is made more to adapt to the case where newsletter archive quantity is changed over time and changed in practice, it can
Using news information existing in network come look-ahead focus incident and emergency event, when sufficient pretreatment can be provided
Between preferably to maintain social stability.
Embodiment 3
Feasibility verifying is carried out to the scheme in Examples 1 and 2 below with reference to Fig. 2-Fig. 4, described below:
Indicate whether significant semantically to assess the word of KAER capture, experimental verification is in vector space model[1]
In similar or related word it is whether closer to each other.For giving the word of theme, its corresponding serial number, root in dictionary is obtained
The word corresponding weight in a model is obtained according to obtained serial number, and then the vector according to similar in weight is found and the specified list
The similar or related word of word.For the similitude of word, compared with KATE, KAER may learn the higher list of similitude
Word.
Mould in the training process is compared in the difference between text vector and input vector in order to assess KAER reconstruct, experiment
The situation of change of type penalty values.In given input vector xiIn the case where, the output vector of modelWith the loss letter between it
Shown in number calculating method such as formula (9).
Wherein, V is the number of text vector,Indicate text vector x after improved autocoderiReconstruct to
Amount.With regard to input vector it is reconstitution for, compared with KATE, KAER can preferably reconstruct the input data of model.
Meanwhile the embodiment of the present invention uses recall rate (Recall), accuracy rate (Precision) and F value (F-
Measure) evaluation criterion as detection scoring, accuracy rate can be understood as the institute that correctly predicted sample size accounts for prediction
There is the ratio of sample size, be mainly used to the accuracy of measure algorithm, as shown in formula (10).Recall rate can be understood as predicting
Related text quantity and corpus in related text quantity ratio, as shown in formula (11).F value can be understood as accuracy
With the harmonic mean of recall rate, as shown in formula (12).
Wherein, the textual data in a certain theme that m expression clustering algorithm correctly detects, n expression cluster in the theme
What algorithm should actually detect belongs to the textual data of the theme.
Wherein, the textual data in a certain theme that m expression clustering algorithm correctly detects, t are indicated in specified text set
What middle clustering algorithm should actually check belongs to the textual data of the theme.
Wherein, P indicates accuracy rate, and R indicates recall rate.
The experimental results showed that the accuracy rate and recall rate of the text vector after dimensionality reduction are apparently higher than text for cluster result
It is before this dimensionality reduction as a result, illustrate improved autocoder by text vector carry out dimensionality reduction to improve the same of efficiency of algorithm
When, text feature has been arrived in preferably study.For example, the topic of millet listing, since in a short time, listing problem causes extensively
Concern, therefore the accuracy rate of the topic and recall rate are all higher;The topic of the Changjiang river No. 1 flood in 2018, due in the time
In section, the rainfall in domestic multiple areas is all larger, and the words such as " heavy rain ", " flood ", " injures and deaths " is caused to be frequent, and reduces
The specific gravity of regional keyword, therefore the accuracy rate of cluster result is lower, since analog result is all considered as same topic by model, because
This recall rate is higher.
Prediction technique assessment for improved growth rate, Comprehensive Experiment result and analysis are it is found that when the appearance of event
Between and time for developing therewith all in the time range of the data set of selection when, can be judged according to the state of development of event
Whether it is focus incident or emergency event, and before the time of the data set of selection, the event have developed into order to
Focus incident, in this case, model are not very ideal the emergency case prediction of this kind of event.In conclusion changing
Into the prediction technique based on growth rate can preferably predict focus incident and emergency event.
In Fig. 2, the experiment effect of KAER ratio KATE is more preferable.In KAER, by after training, penalty values are from the beginning of
0.599 drop to 0.141;And in KATE, penalty values 0.624 drop to 0.157 by.Therefore, with regard to input vector
It is reconstitution for, compared with KATE, KAER can preferably reconstruct the input data of model.
In Fig. 3, under the energy attenuation scheme based on RD, by training, the value of nutrition decay factor is 0.0024,
Since nutrition decay factor is a fixed value, do not distinguished significantly in view of newsletter archive quantity in different time sections has
Factor, therefore, the topic hot value curvilinear motion of the topic is more lofty, and the sawtooth in curve is more, is unfavorable for next
The prediction technique of growth rate based on topic temperature curve.
In Fig. 4, in the improved energy attenuation scheme based on RD, nutrition decay factor is with the time window being currently located
The quantity of middle newsletter archive and dynamic change, therefore be able to fully consider under time windows, newsletter archive quantity has obviously
The topic temperature curve of the factor of difference, the topic is more smooth, is also therefore more in line with the change of topic hot value in practice
Change situation.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention
Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (6)
1. a kind of incident detection and prediction technique based on autocoder, which is characterized in that the method includes following
Step:
Chinese word segmentation is carried out to data text and deactivates processing;By treated, data text is indicated with text vector, and to text
This vector carries out dimensionality reduction operation;
Calculate the similarity between current text and each theme, and by the similarity between text and all themes from small to large into
Row sequence takes the maximum value of similarity to be compared with threshold value and judges the affiliated theme of current text or re-establish new theme;
Calculate topic hot value, will belong to the newsletter archive of a certain topic first time occur time window and predict the topic at
It is less than the event classification of specified threshold into emergency event for the difference of the time window of hot topic.
2. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist
In the calculating topic hot value specifically: the energy attenuation based on RD calculates topic hot value.
3. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist
In, the method also includes:
Emergency event is predicted based on growth rate, and when whether judge a certain theme is emergency event, which is become into hot spot thing
The time window that the time window of part the newsletter archive for belonging to the topic occurs with first time compares.
4. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist
In described to carry out dimensionality reduction operation to text vector specifically:
Threshold value R is added in hidden layer, when the absolute value of the energy value of the failure node in hidden layer is greater than given threshold value R,
Then regard the node as the study to autoencoder network effective;
When the activation value of failure neuron is less than given threshold value R, the activation value of failure neuron is added to successfully on neuron,
And the value of failure neuron is set to zero.
5. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist
In described to predict emergency event based on growth rate specifically:
Define the growth rate curve of theme hot value:
Dt=G [F (yt)],F(yt)∈[F(yA),F(yB)]
Wherein, A represents the time point of maximum rate of growth when theme temperature curve is in build phase, and B indicates that theme temperature is bent
Line is by increasing the time point tended to be steady, F (yA) and F (yB) topic is respectively indicated in the topic hot value of A point and B point;
Meanwhile for the fluctuation problem of the growth rate of topic hot value as caused by time window, using following formula to growth rate into
Row smoothing processing;
Wherein, DtRepresent real growth rate of the theme on time window t, δiIndicate the topic hot value at corresponding time window t
Smoothing factor corresponding to growth rate.
6. a kind of incident detection and prediction technique based on autocoder according to claim 1, feature exist
In the time window and predict that the topic becomes hot topic that the first time of the newsletter archive that will belong to a certain topic occurs
The difference of time window is less than the event classification of specified threshold into emergency event specifically:
Two threshold values T1 and T2 are set, make its similarity between text be greater than T1 if there is a certain theme, then it is assumed that when
Preceding text belongs to this theme and updates theme center vector;If being less than T2, then it is assumed that master not corresponding with current text
Topic, while creating a new theme;If being less than T1 but being greater than T2, then it is assumed that current text it is possible to belong to current topic,
The similarity of all newsletter archives relatively and in sort current text and this theme, if its maximum value is greater than T1, then it is assumed that
Current text belongs to this theme, and the text is included into this theme, while updating theme center.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910401627.0A CN110209813A (en) | 2019-05-14 | 2019-05-14 | A kind of incident detection and prediction technique based on autocoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910401627.0A CN110209813A (en) | 2019-05-14 | 2019-05-14 | A kind of incident detection and prediction technique based on autocoder |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110209813A true CN110209813A (en) | 2019-09-06 |
Family
ID=67787220
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910401627.0A Pending CN110209813A (en) | 2019-05-14 | 2019-05-14 | A kind of incident detection and prediction technique based on autocoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110209813A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113158079A (en) * | 2021-04-22 | 2021-07-23 | 昆明理工大学 | Case public opinion timeline generation method based on difference case elements |
CN113987192A (en) * | 2021-12-28 | 2022-01-28 | 中国电子科技网络信息安全有限公司 | Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937960A (en) * | 2012-09-06 | 2013-02-20 | 北京邮电大学 | Device and method for identifying and evaluating emergency hot topic |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
WO2018086518A1 (en) * | 2016-11-08 | 2018-05-17 | 北京国双科技有限公司 | Method and device for real-time detection of new subject |
CN108805167A (en) * | 2018-05-04 | 2018-11-13 | 江南大学 | L aplace function constraint-based sparse depth confidence network image classification method |
CN108932311A (en) * | 2018-06-20 | 2018-12-04 | 天津大学 | The method of incident detection and prediction |
-
2019
- 2019-05-14 CN CN201910401627.0A patent/CN110209813A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937960A (en) * | 2012-09-06 | 2013-02-20 | 北京邮电大学 | Device and method for identifying and evaluating emergency hot topic |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
WO2018086518A1 (en) * | 2016-11-08 | 2018-05-17 | 北京国双科技有限公司 | Method and device for real-time detection of new subject |
CN108805167A (en) * | 2018-05-04 | 2018-11-13 | 江南大学 | L aplace function constraint-based sparse depth confidence network image classification method |
CN108932311A (en) * | 2018-06-20 | 2018-12-04 | 天津大学 | The method of incident detection and prediction |
Non-Patent Citations (2)
Title |
---|
孙红光等: "基于改进Single-Pass算法的网络新闻话题发现", 《吉林大学学报(理学版)》 * |
林榆旺: "突发事件检测和预测方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113158079A (en) * | 2021-04-22 | 2021-07-23 | 昆明理工大学 | Case public opinion timeline generation method based on difference case elements |
CN113987192A (en) * | 2021-12-28 | 2022-01-28 | 中国电子科技网络信息安全有限公司 | Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm |
CN113987192B (en) * | 2021-12-28 | 2022-04-01 | 中国电子科技网络信息安全有限公司 | Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN110046250A (en) | Three embedded convolutional neural networks model and its more classification methods of text | |
Jiang et al. | Fake news detection using deep recurrent neural networks | |
CN110889003B (en) | Vehicle image fine-grained retrieval system based on text | |
CN110209813A (en) | A kind of incident detection and prediction technique based on autocoder | |
Al-Omari et al. | JUSTDeep at NLP4IF 2019 task 1: Propaganda detection using ensemble deep learning models | |
CN111339440B (en) | Social emotion sequencing method based on hierarchical state neural network for news text | |
Tao et al. | News text classification based on an improved convolutional neural network | |
Khalid et al. | Topic detection from conversational dialogue corpus with parallel dirichlet allocation model and elbow method | |
Das et al. | Group incremental adaptive clustering based on neural network and rough set theory for crime report categorization | |
Wang et al. | Chinese news text classification based on attention-based CNN-BiLSTM | |
CN113806528A (en) | Topic detection method and device based on BERT model and storage medium | |
Guo | Intelligent sports video classification based on deep neural network (DNN) algorithm and transfer learning | |
CN116756303A (en) | Automatic generation method and system for multi-topic text abstract | |
CN110457685A (en) | A kind of Chinese business Text Pretreatment method based on machine learning | |
CN110348497A (en) | A kind of document representation method based on the building of WT-GloVe term vector | |
CN112926340B (en) | Semantic matching model for knowledge point positioning | |
Sheela et al. | Caviar-Sunflower Optimization Algorithm-Based Deep Learning Classifier for Multi-Document Summarization | |
Yi et al. | Machine learning algorithms with co-occurrence based term association for text mining | |
Guo et al. | Ernie-bilstm based Chinese text sentiment classification method | |
Zhou et al. | Hierarchical attention-based fuzzy neural network for subject classification of power customer service work orders | |
Allawadi et al. | Multimedia data summarization using joint integer linear programming | |
Chen et al. | Pseudo-supervised approach for text clustering based on consensus analysis | |
Gaozheng et al. | Research on SVM Fault Diagnosis Method Based on Text Feature Extraction Algorithm | |
Wei et al. | Multi-Label Text Classification Model Based on Multi-Level Constraint Augmentation and Label Association Attention |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190906 |
|
RJ01 | Rejection of invention patent application after publication |