Embodiment
The embodiment of the present application provides a kind of abnormity early warning method and apparatus of User Perspective.
In order that those skilled in the art more fully understand the technical scheme in the application, it is real below in conjunction with the application
The accompanying drawing in example is applied, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described
Embodiment be only some embodiments of the present application, rather than whole embodiment.Based on the embodiment in the application,
The every other embodiment that those of ordinary skill in the art are obtained under the premise of creative work is not made, all should
Belong to the scope of the application protection.
Fig. 1 is the schematic flow sheet of the abnormity early warning method for the User Perspective that the embodiment of the application one is proposed, in the figure institute
In the embodiment shown, by being clustered to the customer documentation for being related to preset content, and extract expressed by each clustering topics
User Perspective, User Perspective is analyzed, so as to carry out early warning to the abnormal user viewpoint of rapid growth.As schemed
Shown in 1, this method includes:
Step 101, the customer documentation for meeting preparatory condition is obtained.
Specifically, obtaining the mode of customer documentation has a variety of, for example, it can be obtained from webpage, from the crawl of default website,
Or extracted from known database, it can also be obtained from the record of pre-set programs.Preparatory condition can be and spy
Determine the correlations such as event, product, or comprising default vocabulary, sentence etc., for example, can from be related to preset content or
User's message, forwarding comment of correlation etc. are captured on the microblogging webpage of vocabulary, can also be directly obtained from internal channel
Comment, message, feedback, complaint of user etc. are obtained in user feedback record.
Step 102, the customer documentation is clustered.Each customer documentation can be calculated by existing clustering algorithm
Similarity is simultaneously clustered.
Step 103, the User Perspective of the clustering topics is extracted.Key in the document group that can be obtained according to cluster
Word extracts the User Perspective expressed by document group, will be specifically described in detail in subsequent embodiment.
Step 104, early warning is carried out according to the customer documentation quantity of the User Perspective in preset time.
It is described that customer documentation progress cluster is included according to one embodiment of the application:Extract user's text
User view feature in shelves;Documents Similarity analysis is carried out to the user view feature;According to Documents Similarity point
The result of analysis is clustered to the customer documentation.Specifically, for cluster, different clustering algorithms are substantially
All it is to be clustered by the measurement of various similarities.The application can use a variety of clustering methods, preferably by
Streaming clustering method, i.e., the clustering algorithm learnt based on online is suitable according to the time such as Single Pass algorithms
Ordered pair customer documentation is clustered in real time, by extracting the feature for being best able to express user view in customer documentation, with this
According to carry out similarity analysis and cluster to document, to enable to cluster user's meaning expressed by obtained document group
Figure is closest, and the cluster degree of accuracy is higher, and efficiency is faster.
According to one embodiment of the application, user view feature include interdependent feature, text feature, verb feature and
User behavior feature.Wherein, interdependent feature is a kind of algorithm of dependence between descriptor and word.In interdependent syntax
In, each sentence is the word of a most critical, and this word can be for representing the intention of user.Specifically, can be with
Carry out interdependent feature extraction respectively to customer documentation and obtain interdependent feature, carry out Text Pretreatment and obtain text feature, carry
The verb in document is taken to obtain verb feature, the behavior related to preset content to user is extracted and screening is used
Family behavioural characteristic.Above-mentioned user view feature is extracted, enables to the feature extracted more efficient, so as to strengthen cluster
The effect and accuracy of algorithm.
According to one embodiment of the application, the User Perspective of the extraction clustering topics includes:To the cluster
Customer documentation in theme carries out word frequency sequence;Sorted according to the word frequency and extract the User Perspective of the clustering topics.
All customer documentations in clustering topics can be carried out with word frequency sequence, screening obtains the several keywords of word frequency highest,
The position occurred according to the keyword filtered out in each document, analysis obtains the word order of these keywords, final to extract
To the User Perspective of the clustering topics.
According to one embodiment of the application, the customer documentation quantity according to the User Perspective in preset time is entered
Row early warning includes:Count the number of documents information of the User Perspective in preset time;According to the number of documents information
Calculate the number of documents average of the User Perspective in preset time;Newly-increased number of documents and institute when the User Perspective
When the distance for stating number of documents average is more than the first predetermined threshold value, abnormal viewpoint early warning is carried out.Wherein, number of documents is believed
Breath can be the increased number of documents of the User Perspective in preset time, the added value in the unit interval, in preset time
The quantity statistics information such as number average value, growth rate in one or more, preset time can be according to statistical demand
Setting, such as newly-increased number of documents intraday to a certain User Perspective is monitored, then can obtain nearest 30
Document data in it calculates the number of documents average for belonging to the User Perspective occurred daily.By according to a use
There is the number average value of customer documentation in preset time period in family viewpoint, judges whether newly-increased number of documents is abnormal, so that
Early warning can be carried out by finding this quantitative exception.The present embodiment can pass through the method based on rbf kernel
Realize, will specifically be described in detail in subsequent embodiment.
According to one embodiment of the application, the customer documentation quantity according to the User Perspective in preset time is entered
Row early warning includes:Count the number of documents information of the User Perspective in preset time;According to the number of documents information
Newly-increased number of documents to the User Perspective is predicted, and obtains the pre- quantitation of newly-increased document;When the newly-increased text
When gear number amount and the difference of the pre- quantitation are more than the second predetermined threshold value, abnormal viewpoint early warning is carried out.Specifically, may be used
To be predicted using the method based on time series.Time forecasting methods are a kind of conventional to the progress of following quantity
Forecasting Methodology.Common time forecasting methods have arima methods.Arima methods be it is a kind of based on historical information come pair
The method that future is predicted.It can be calculated according to history archive quantity (such as first three ten days daily number of documents)
To the prediction number of files value of today.If the number of files included in clustering topics is much larger than the quantity of history, just enter
Row alarm.It should be noted that application of the arima methods in terms of quantitative forecast is carried out based on time series can be found in
Related technical documentation, for example《The three of time series forecasting technology --- the ARIMA model predictions containing independent variable》
(Shen Hao, 2009-12-02) etc., the application is repeated no more to this.
According to one embodiment of the application, the customer documentation quantity according to the User Perspective in preset time is entered
Row early warning includes:Count the number of documents information of the User Perspective in preset time;According to the number of documents information
Calculate the number of documents average of the User Perspective in preset time;The user is seen according to the number of documents information
The newly-increased number of documents of point is predicted, and obtains the pre- quantitation of newly-increased document;When the newly-increased document of the User Perspective
The distance of quantity and the number of documents average is more than the first predetermined threshold value, and the newly-increased number of documents and the prediction
When the difference of quantity is more than the second predetermined threshold value, abnormal viewpoint early warning is carried out.The present embodiment combines sentencing for two kinds of early warning
Broken strip part, just carries out abnormity early warning to the User Perspective when above-mentioned two situations occur simultaneously, can effectively reduce mistake
Probability is reported, the correctness of early warning is significantly improved.
, can be by being clustered to customer documentation according to embodiments herein, and extract expressed by each clustering topics
User Perspective, analyzed, can supervised in real time by the customer documentation quantity to a certain User Perspective in preset time
The number of documents growth rate of each User Perspective is surveyed, early warning is made in data exception, is conducive to finding that user sees in time
The surge of the extensive surge, especially negative view of point so that enterprise can rapidly make a response after pinpointing the problems,
It is prevented effectively from and is worse off, improves the initiative solved the problems, such as.
Based on same inventive concept, the embodiment of the present application additionally provides a kind of abnormity early warning device of User Perspective, can be with
For realizing the method described by above-described embodiment, as described in the following examples.Due to the abnormity early warning of User Perspective
The principle that device solves problem is similar to the abnormity early warning method of User Perspective, therefore the abnormity early warning device of User Perspective
Implementation may refer to User Perspective abnormity early warning device implementation, repeat part repeat no more.It is used below,
Term " unit " or " module " can realize the combination of the software and/or hardware of predetermined function.Although following real
Apply the device described by example preferably to realize with software, but hardware, or the combination of software and hardware realization
May and it be contemplated.
Fig. 2 is the structural representation of the abnormity early warning device of the User Perspective of the embodiment of the application one.The dress of the present embodiment
The logical block that putting can be to realize corresponding function is constituted, or operation has the electronic equipment of corresponding function software.
As shown in Fig. 2 the abnormity early warning device of the User Perspective includes:Acquisition module 100, cluster module 200,
Extraction module 300 and warning module 400.
Specifically, acquisition module 100 is used to obtain the customer documentation for meeting preparatory condition.
Cluster module 200 is used to cluster the customer documentation.
Extraction module 300 is used for the User Perspective for extracting the clustering topics.
Warning module 400 is used to carry out early warning according to the customer documentation quantity of the User Perspective in preset time.
It is the structural representation of the abnormity early warning device of the User Perspective of another embodiment of the application shown in Fig. 3.
According to one embodiment of the application, as shown in figure 3, cluster module 200 includes extracting sub-module 210, phase
Like degree analysis submodule 220 and cluster submodule 230.
Specifically, extracting sub-module 210 is used to extract the user view feature in the customer documentation;
Similarity analysis submodule 220 is used to carry out Documents Similarity analysis to the user view feature;
The result that cluster submodule 230 is used to be analyzed according to Documents Similarity is clustered to the customer documentation.
According to one embodiment of the application, extracting sub-module 210 is specifically for extracting the interdependent spy in the document
Levy, text feature, verb feature and user behavior feature.
According to one embodiment of the application, as shown in figure 3, extraction module 300 can include word frequency sorting sub-module
310 and viewpoint extracting sub-module 320.Wherein, word frequency sorting sub-module 310 is used for the use in the clustering topics
Family document carries out word frequency sequence;Viewpoint extracting sub-module 320 is used to extract the clustering topics according to word frequency sequence
User Perspective.
According to one embodiment of the application, as shown in figure 4, warning module 400 can include statistic submodule 410,
The early warning submodule 430 of calculating sub module 420 and first.Wherein, statistic submodule 410 is used to count in preset time
The number of documents information of the User Perspective;Calculating sub module 420 is used to calculate default according to the number of documents information
The number of documents average of the User Perspective in time;First early warning submodule 430 is used in the new of the User Perspective
When the distance for increasing number of documents and the number of documents average is more than the first predetermined threshold value, abnormal viewpoint early warning is carried out.
According to one embodiment of the application, as shown in figure 5, warning module 400 can include statistic submodule 410,
Predict the early warning submodule 450 of submodule 440 and second.Statistic submodule 410, it is described in preset time for counting
The number of documents information of User Perspective;Prediction submodule 440 is used to see the user according to the number of documents information
The newly-increased number of documents of point is predicted, and obtains the pre- quantitation of newly-increased document;Second early warning submodule 450 is used to work as
When the newly-increased number of documents and the difference of the pre- quantitation are more than the second predetermined threshold value, abnormal viewpoint early warning is carried out.
According to one embodiment of the application, as shown in fig. 6, warning module 400 can include statistic submodule 410,
Calculating sub module 420, the prediction early warning submodule 460 of submodule 440 and the 3rd.Wherein, the 3rd early warning submodule 470
Distance for the newly-increased number of documents in the User Perspective and the number of documents average is more than the first predetermined threshold value,
And the difference of the newly-increased number of documents and the pre- quantitation is when being more than the second predetermined threshold value, abnormal viewpoint is carried out pre-
It is alert.
According to embodiments herein, it can be extracted by being clustered to customer documentation expressed by each clustering topics
User Perspective, and analyzed by the customer documentation quantity to a certain User Perspective in preset time, monitoring is each in real time
The number of documents growth rate of User Perspective, early warning is made in data exception, is conducive to finding User Perspective in time
It is extensive to increase sharply, the especially surge of negative view so that enterprise can rapidly make a response after pinpointing the problems, and have
Effect avoids being worse off, and improves the initiative solved the problems, such as.
Be shown in Fig. 7 the specific embodiment of the application one the use above method and device User Perspective is carried out it is abnormal pre-
Alert schematic flow sheet:
Step 1, the customer documentation for meeting preparatory condition is obtained.
Specifically, obtaining the mode of customer documentation has a variety of, for example, it can be obtained from webpage, from the crawl of default website,
Or extracted from known database, it can also be obtained from the record of pre-set programs.Preparatory condition can be and spy
Determine the correlations such as event, product, or comprising default vocabulary, sentence etc., for example, can from be related to preset content or
User's message, forwarding comment of correlation etc. are captured on the microblogging webpage of vocabulary, can also be directly obtained from internal channel
Comment, message, feedback, complaint of user etc. are obtained in user feedback record.Specifically for example to the official of Alibaba
The crawl comment related with " ant spend " in microblogging.
Step 2, the interdependent feature in the customer documentation is extracted.
Specifically, interdependent feature is a kind of feature for describing dependence between word and word in sentence.In interdependent feature sentence
In method, each sentence around a most critical word, this word can for represent user intention.Specifically may be used
To extract the interdependent feature in customer documentation according to existing interdependent feature algorithm.
Step 3, the text feature in the customer documentation is extracted.
Specifically, conventional pretreatment can be carried out to the text in the customer documentation, because for early warning analysis
The text of customer documentation is short dialogue mostly, so usually not necessity is carried out participle, but passes through 2-gram
(a kind of conventional segmenting method for being not based on dictionary, for a word to be split according to two words, for example flower
Service charge is divided into:Flower, hand, formality continues to pay dues) pre-processed.Carry out 2-gram pretreatments after it
Afterwards, a vector is converted the text to by text vector spatial model.
Step 4, the verb feature in the customer documentation is extracted.
In general, verb is a most important word in a sentence, user view can be most represented.So by sentence
The middle verb for representing user view is extracted, and can also accurately state user view feature.
Step 5, the user behavior feature in the customer documentation is extracted.
Specifically, user's feature extraction related to preparatory condition can be come out.Suitable user characteristics is selected for carrying
The correctness of high-class, has great significance.At present, user behavior feature is mainly selected by business experience.
Such as preparatory condition is product " ant flower ", then can extract user and whether open the product, user's stepping on recently
Record address, the nearest IP address of user etc..
Step 6, Documents Similarity analysis is carried out to the user view feature.
Wherein, user view feature includes above-mentioned interdependent feature, text feature, verb feature and user behavior feature.
Specifically, classical clustering algorithm typically has the formula of a similarity measurement.In the present embodiment, with based on
Illustrated exemplified by the similarity measurement formula of cosine distances.Formula is as follows:
sim(doc1, doc2)=α cos (text1, text2)+βcos(dep1, dep2)+γ(verb1, verb2)+θ(beh1, beh2)
Alpha+beta+γ+θ=1
Wherein, doc1And doc2Represent two customer documentations, text1And text2It is doc respectively1And doc2In text
Characteristic, dep1And dep2It is doc respectively1And doc2In interdependent feature syntactic component, verb1 and verb2 points
It is not the verb characteristic in doc1 and doc2, beh1And beh2It is doc respectively1And doc2In user behavior
Characteristic, cos () refers to measuring similarity by cosine value, and α, beta, gamma, θ refers to corresponding weight.
General rule is followed, the scope of similarity requires α usually between 0 to 1, and beta, gamma, θ adds up to 1.
In general, similarity closer to 1, two word just closer to.Similarity is more dissimilar closer to 0, two word,
That is, the represented semantic difference of two words is bigger.
It is to be appreciated that in addition to above-mentioned four kinds of features, user view feature can also have a variety of, corresponding similarity
Measure equation is also corresponding different.Four kinds of features that the present embodiment is selected enable to the feature extracted more efficient, from
And strengthen the effect and accuracy of clustering algorithm.
Step 7, the result analyzed according to Documents Similarity is clustered to the customer documentation.
For example, by taking the clustering algorithm learnt based on online as an example, customer documentation can be entered sequentially in time
Row is clustered in real time.
Firstly the need of some hyper parameters of assignment algorithm, t1 is the upper limit of similarity, and t2 is the lower limit of similarity.t1
And t2 span be 0 to 1 between.
Specifically, at first, clustering topics number is 0, i.e., all customer documentations all do not belong to cluster master
Topic.By each customer documentation flowed into sequentially in time, above-mentioned various user view feature extractions are carried out, one is obtained
Individual big vector, then calculates the barycenter of the document group of each clustering topics, then calculates the customer documentation newly flowed into respectively
With the similarity of the barycenter of each clustering topics, if being more than t1 with the similarity of a certain barycenter, by this user's text
Shelves ownership is in this clustering topics.If similarity is all less than t2, independent using this customer documentation as one
Theme.If similarity is between t1 and t2, then it represents that the similarity of the customer documentation is difficult to define, it can throw
Abandon this document.
Step 8, word frequency sequence is carried out to the customer documentation in the clustering topics.
Specifically, show in order to be able to preferably carry out viewpoint, the method that simple viewpoint can be selected to extract.For example,
The word frequency of all customer documentations in each clustering topics can be counted, is sorted for the word in each theme according to word frequency.
Then, screening obtains coming preceding 10 word, is used as the high frequency words of the clustering topics.
Step 9, sorted according to the word frequency and extract the User Perspective of the clustering topics.
Specifically, the position that each high frequency words filtered out occur in each customer documentation can be counted, and calculating is averaged
These high frequency words are ranked up by the value of position according to the value of mean place, and analysis obtains the word order of these high frequency words,
Finally extract the User Perspective of the clustering topics.For example, the word frequency obtained high frequency words of screening " are opened for " flower "
It is logical " " can not ", in these three high frequency words generation, can be returned in original text shelves and obtain positional value, specifically for example, user's text
Occurred in that successively in shelves " flower " " can not " two keywords, the positional value of " flower " in the document is 1,
" can not " positional value in the document is 2, by that analogy, can get each high frequency in the clustering topics
Positional value of the word in each customer documentation, the mean place that " flower " is worth to by being averaged for calculated location value is
1.3, the mean place of " open-minded " is 3.5, " can not " mean place be 2.3, can be obtained according to mean place sequence
To viewpoint " flower can not open ".
In early warning part, the number of documents early warning of User Perspective can be carried out by following three kinds of modes.
Step 10, the number of documents information of the User Perspective in preset time is counted.
Wherein, number of documents information can be the increased number of documents of the User Perspective in preset time, in the unit interval
The quantity statistics information such as added value, the number average value in preset time, growth rate in one or more, preset
Time can be set according to statistical demand, such as newly-increased number of documents intraday to a certain User Perspective is monitored,
The document data in nearest 30 days can be so obtained to calculate the number of documents for belonging to the User Perspective occurred daily
Average.
Step 11, the number of documents average of the User Perspective in preset time is calculated according to the number of documents information.
Step 12, when the newly-increased number of documents of the User Perspective and the distance of the number of documents average are more than first
During predetermined threshold value, abnormal viewpoint early warning is carried out.
Specifically, step 10-12 method for early warning can be by based on rbf kernel (Radial basis kernel function, Radial Basis
Function kernel) method realize.Rbf kernel formula form is as follows:
K (x, x ')=exp (- a | | x-x ' | |)2
First, using the method based on rbf kernel, with the data instance of one month, the history number of one month is passed through
According to can obtain and belong to the number of documents of the User Perspective per annual average, and obtain the user in history one month and see
The standard deviation of the number of documents of point.Calculate daily document in the customer documentation quantity for newly flowing into the User Perspective and one month
The distance of number average value, when such distance is more than predetermined threshold value (such as twice of standard deviation), with regard to carrying out early warning.
So by occurring the number average value of customer documentation in preset time period according to a User Perspective, judge newly-increased
Whether number of documents is abnormal, so as to carry out early warning by finding this quantitative exception.
Optionally, early warning can also be carried out to the customer documentation quantity of the User Perspective by step 13-15.
Step 13, the number of documents information of the User Perspective in preset time is counted.Referring to step 10.
Step 14, the newly-increased number of documents of the User Perspective is predicted according to the number of documents information, obtained
To the pre- quantitation of newly-increased document.
Step 15, when the newly-increased number of documents and the difference of the pre- quantitation are more than the second predetermined threshold value, enter
Row exception viewpoint early warning.
Specifically, it can be predicted using the method based on time series.Time forecasting methods are a kind of conventional
Method is predicted to following quantity.Common time forecasting methods have arima methods.Arima methods are a kind of bases
The method being predicted in historical information to future.Can be according to history archive quantity (such as first three ten days daily text
Gear number amount) calculate the prediction number of files value for obtaining today.Gone through if the number of files included in clustering topics is much larger than
The quantity of history, with regard to being alarmed.It should be noted that arima methods are carrying out quantitative forecast side based on time series
The application in face can be found in the technical documentation of correlation, for example《The three of time series forecasting technology --- containing independent variable
ARIMA model predictions》(Shen Hao, 2009-12-02) etc., the application is repeated no more to this.
, can also be by step 10-15 two ways come jointly to the user in another embodiment of the application
The customer documentation quantity of viewpoint carries out early warning, when newly-increased number of documents and the number of documents average of the User Perspective
Distance be more than the first predetermined threshold value, and the difference of the newly-increased number of documents and the pre- quantitation is more than second and preset
During threshold value, abnormal viewpoint early warning is just carried out.Misinformation probability can be effectively reduced, the correctness of early warning is significantly improved.
The present embodiment can be by clustering to customer documentation, and extracts the User Perspective expressed by each clustering topics,
Analyzed by the customer documentation quantity to a certain User Perspective in preset time, each User Perspective can be monitored in real time
Number of documents growth rate, make early warning in data exception, be conducive to finding that the extensive of User Perspective swashs in time
Increase, the especially surge of negative view so that enterprise can rapidly make a response after pinpointing the problems, and be prevented effectively from feelings
Condition deteriorates, and improves the initiative solved the problems, such as.The effect of clustering algorithm is enhanced by extracting effective user view feature
Really;Using streaming clustering method, calculating in real time can be better adapted to, cluster is rapider accurate.
It should be noted that in the description of the present application, term " first ", " second " etc. are only used for describing purpose, and
It is not intended that indicating or implying relative importance.In addition, in the description of the present application, it is unless otherwise indicated, " many
It is individual " it is meant that two or more.
Any process described otherwise above or method description are construed as in flow chart or herein, represent to include
Module, the fragment of the code of one or more executable instructions for the step of realizing specific logical function or process
Or part, and the scope of the preferred embodiment of the application includes other realization, wherein can not by shown or
The order of discussion, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function,
This should be understood by embodiments herein person of ordinary skill in the field.
It should be appreciated that each several part of the application can be realized with hardware, software, firmware or combinations thereof.Upper
State in embodiment, multiple steps or method can be performed in memory and by suitable instruction execution system with storage
Software or firmware realize.If for example, being realized with hardware, with another embodiment, this can be used
Any one of following technology known to field or their combination are realized:With for realizing logic to data-signal
The discrete logic of the logic gates of function, the application specific integrated circuit with suitable combinational logic gate circuit, can
Program gate array (PGA), field programmable gate array (FPGA) etc..
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried
Rapid to can be by program to instruct the hardware of correlation to complete, described program can be stored in a kind of computer-readable deposit
In storage media, the program upon execution, including one or a combination set of the step of embodiment of the method.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means to combine specific features, structure, material that the embodiment or example are described
Or feature is contained at least one embodiment of the application or example.In this manual, above-mentioned term is shown
The statement of meaning property is not necessarily referring to identical embodiment or example.Moreover, description specific features, structure, material or
Person's feature can in an appropriate manner be combined in any one or more embodiments or example.
Although embodiments herein has been shown and described above, it is to be understood that above-described embodiment is example
Property, it is impossible to the limitation to the application is interpreted as, one of ordinary skill in the art within the scope of application can be right
Above-described embodiment is changed, changed, replacing and modification.