Detailed description of the invention
Understandable for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from, below in conjunction with attached
The present invention is further detailed explanation with detailed description of the invention for figure.
Big data (big data), or claim flood tide data, refer to that involved data quantity is huge to be arrived
Current main software instrument cannot be passed through, reaching within reasonable time to capture, manage, processing and whole
Reason becomes the information helping the more positive purpose of enterprise management decision-making, and it is frequently used for society's emotion statistics field
As in Social Public Feelings or the will of the people add up, in order to find much-talked-about topic.
Wherein, the topic that much-talked-about topic is paid close attention to by numerous users often, i.e. if user's attention rate height
Topic, it produces with the concern of users is inseparable, and therefore, user behavior is in much-talked-about topic
Detection during have important effect.
User network behavior mainly includes user access activity and user's search behavior.Wherein, Yong Hufang
The behavior of asking can embody behavioural habits or the personal interest of user, and from the overall situation, multiple users'
Access behavior then can embody user to some or the attention rate of class webpage.And user's search behavior is
User inputs the action that keyword scans in a search engine, and it can accurately express the meaning of user
Figure, and often the page in Search Results will be produced access behavior after user's search behavior.Once searching
In the page access behavior of rope and association, the search key word of user may be used for institute's accession page
Theme feature describes, therefore from the overall situation, the search behavior of multiple users also is able to embody user to certain
The concern of a little key words.
Therefore, the embodiment of the present invention using user network behavioral data as the important evidence of topic detection,
Corresponding testing process specifically may include that and is first depending on user network behavioral data extraction focus webpage
Then these focus webpages are entered by (described focus webpage can be used for representing the webpage that user's attention rate is high)
Row cluster, (described potential much-talked-about topic class is likely to be of focus to obtain corresponding potential much-talked-about topic class
Property), and using described potential much-talked-about topic class as kind of a subclass, newly-increased webpage is carried out increment cluster,
Potential much-talked-about topic class after finally clustering for increment, is joined by the user's attention rate analyzing its correspondence
Number, it is determined that whether it is much-talked-about topic class.Owing to described newly-increased webpage both can include history web pages also
Gauze page can be included in, it is seen then that the data that the testing process of the embodiment of the present invention is used are the most permissible
Including focus webpage such history web pages data, gauze page data can be included in again, therefore the present invention
Embodiment can be provided simultaneously with backtracking detection and the respective advantage of on-line checking, had both possessed backtracking detection
Effect, possesses again the ageing of on-line checking;Further, since what embodiment of the present invention detection was used
Focus webpage is for obtaining according to the extraction of user network behavioral data, and its data volume is little, therefore ensure that
Detection efficiency;Therefore, the embodiment of the present invention can a large amount of web page texts be quickly the most more in internet environment
Under news, ensure the accuracy of Detection results, ageing and high efficiency simultaneously.
With reference to Fig. 1, it is shown that a kind of topic detecting method embodiments based on big data of the embodiment of the present invention
Flow chart, specifically may include that
Step 101, foundation user network behavioral data extract focus webpage;
In the art, user network behavioral data can be used for characterizing the main number of user network behavior
According to, it can come from the journal file collection in the webserver of operator or website, these journal files
Collection i.e. can be considered big data as herein described.These journal file collection contain provider customer or website
The HTTP(HTML (Hypertext Markup Language) of user, Hypertext transfer protocol) affairs perform note
Record, can be similar to this technology of network packet sniff technology from journal file concentration by utilization
Obtain the data for characterizing user network behavior.
Specifically, the user behavior data that journal file is concentrated mainly comprises user's search behavior data
With user access activity data.Wherein, described user's search behavior data record has the search of user to close
Keyword and corresponding result of page searching, described user access activity data record has the access page of user
Face, described result of page searching and accession page are generally with URL(URL, Uniform
Resource Locator) form record.Concrete, some operator or the user access activity of website
Data also can record user's physical address and search jump information, some operator or the user of website
Search behavior data also can record the hyperlink letter of webpage in user's physical address and result of page searching
Breath;Here, user's physical address mainly can include User IP ((between network interconnection agreement,
Internet Protocol) address, search jump information is used for representing whether the current accessed page derives from and searches
Rope results page, and, when deriving from result of page searching, result of page searching also can be recorded
Information (such as page address etc.).
In one preferred embodiment of the invention, described user network behavioral data specifically can include
One or more in user access activity data and user's search behavior data;
The most described step according to user network behavioral data extraction focus webpage, specifically may include that
Sub-step S111, according to described user access activity data, obtain user's visit capacity or user visit
Ask that frequency meets the webpage of the first prerequisite, as focus webpage;And/or
Sub-step S112, according to described user's search behavior data, obtain user's volumes of searches or user search
Rope frequency meets the webpage associated by the key word of the second prerequisite, as focus webpage.
Wherein, described first prerequisite can be user's visit capacity or user's access frequency comes front K1
Position, described second prerequisite can be user's volumes of searches or user's search rate comes front K2 position;This
Skilled person can be according to preset described K1 and K2 of actual demand, and the embodiment of the present invention is to specifically
The numerical value of K1 and K2 be not any limitation as.
In a kind of application example of the embodiment of the present invention, user access activity data can represent
For:<(time1, url1), (time2, url2) ..., (timen, urln)>, wherein time1 ... timen and
Url1 ... urln represents time and the URL of access respectively;In actual applications, by a large number of users
Access the analysis of behavioral data, the webpage URL of K1 before ranking can be obtained, be represented by (ti, < (url1,
visitors1),….,(urlk1,visitorsk1)>)。
User's search behavior data can be expressed as: < (time1, se1, keyword1), (and time2, se2,
Keyword2) ... (timen, sen, keywordn) >, wherein, time1 ... when timen represents search
Between, se1 ... sen represents the search engine of use, keyword1 ... keywordn represents that search is closed
Keyword;By the analysis to a large number of users search behavior data, before can obtaining ranking, the search of K2 is closed
Keyword, is represented by ti, and<(keywords1, num1) ..., (keywordsk2, numk2)>).
In above-mentioned parameter, ti represents the appointment time period, url1 and visitor1 represents url1 and visit thereof respectively
The amount of asking, keywords1 and num1 represents search key word and volumes of searches thereof respectively.
On the basis of obtaining Top K2 search key word, each Top K2 of acquisition can be analyzed further
The webpage URL of search key word association, can be expressed as (keywords, < (url1, visitors1) ...,
(urlk,visitk)>)。
Step 102, gather the content of described focus webpage;
In actual applications, the technology such as web crawlers can be used to gather the content of described focus webpage,
Concrete acquisition method is not any limitation as by the embodiment of the present invention.
Step 103, content according to described focus webpage, extract that to obtain the webpage of described focus webpage special
Levy vector;
In a kind of application example of the present invention, the described content according to described focus webpage, extract
To the step of the web page characteristics vector of described focus webpage, specifically may include that
The content that sub-step S131, foundation collect, the web page contents extracting described focus webpage is special
Levy;
The web page contents collected can be resolved by sub-step S131, obtains web page title, webpage
The characteristic informations such as text, webpage description.
Sub-step S132, according to focus webpage described in described web page contents feature construction web page characteristics to
Amount.
Sub-step S132 can carry out participle and part of speech according to the acquisition PRELIMINARY RESULTS of web page contents feature
The work such as mark, carry out stop words filtration etc. and process, and the content lexical set after process can be as structure
The foundation of networking page characteristic vector.
In one preferred embodiment of the invention, VSM type (vector space mould, Vector can be used
Space Model) as Text Representation, document representation is become a vector by VSM, and vectorial is every
One Feature Words of one-dimensional representation;It specifically can pass through TF*IDF(word frequency-reverse document-frequency, term
Frequency inverse document frequency) carry out the weight of defined feature word:
wi=tfsi×log(N/ni) (1)
Wherein, wi represents the weight of lexical item ti, and tfsi represents lexical item ti important journey in current web page
Degree, N represents the web document quantity included by described focus webpage correspondence background corpus, and ni represents the back of the body
Scape corpus comprises the web document quantity of ti.
In one preferred embodiment of the invention, can consider that lexical item ti is at web page title, net respectively
The number of times occurred in page content and webpage statement, and seek weighted sum by importance, to obtain tfsi, accordingly
Computing formula as follows:
tfsi=pi×α+mi×β+ci× γ (2)
Wherein, during pi, mi, ci represent that lexical item ti is in web page title, web page contents and webpage are stated respectively
The number of times occurred, α, beta, gamma represents respective weight the most respectively.
For reducing intrinsic dimensionality, simplify and calculate, and prevent the phenomenons such as over-fitting, in the one of the present invention
Plant in preferred embodiment, according to the size of weighted value, lexical item ti of certain focus webpage can be arranged
Sequence, and weight selection value is more than specifying lexical item ti of threshold value w as Feature Words, owning of certain focus webpage
Feature Words constitutes corresponding web page characteristics vector.Appointment threshold value w therein can be by those skilled in the art
Preset according to actual demand, concrete appointment threshold value w is not any limitation as by the embodiment of the present invention.
It should be noted that above-mentioned VSM is intended only as a kind of net building described focus webpage of the present invention
The preferred embodiment of page characteristic vector, the application being not intended as the embodiment of the present invention limits.
Step 104, according to described focus webpage web page characteristics vector described focus webpage is gathered
Class, obtains corresponding potential much-talked-about topic class;
User's attention rate height is a key character of much-talked-about topic, and therefore, the embodiment of the present invention is by poly-
Class obtains the potential much-talked-about topic class that user's attention rate is high.It should be noted that described potential much-talked-about topic
Class is likely to be of focus, it is also possible to do not have focus, needs in follow-up testing process further
Judge.
Can being described as of cluster: the object that the set of physics or abstract object is divided into by being similar to is formed
The process of multiple classes be referred to as cluster.Generated by cluster bunch is the set of one group of data object,
These objects are the most similar to the object in same bunch, different with the object in other bunches.
Traditional clustering method specifically may include that division methods, hierarchical method, side based on density
Method, method based on grid, method based on model, Transitive Closure Method, Boolean matrix method, directly
Clustering procedure, correlation analysis clustering procedure and clustering method based on statistics etc..
In one preferred embodiment of the invention, the K average in division methods can be used
(K-Means) clustering method, the basic thought of K mean cluster is: accept input quantity K;Then
N data object is divided into K cluster so that the cluster obtained meets: in same cluster
Object similarity higher;And object similarity in different cluster is less.
In a kind of application example of the present invention, the process that realizes of K mean cluster specifically may include that
First from all focus webpages, select K the center as K initial clustering;For other focus
Webpage, then according to the similarity at they centers with initial clustering, assign these to its phase respectively
Near initial clustering;Recalculate cluster centre (all heat in cluster accordingly of each new cluster the most again
The average of some webpage);Constantly repeat this process until canonical measure function (such as mean square deviation) starts
Till convergence.
In implementing, K value can be actually needed by those skilled in the art's foundation and be configured;Can
VSM is utilized to calculate similarity sim (D1, D2) of certain focus webpage D1 and the center D2 of certain cluster,
When this similarity is more than a certain similarity threshold, this focus webpage can be distributed to this cluster, this
This concrete similarity threshold is not any limitation as by bright embodiment.,
In a kind of application example of the present invention, sim (D1, D2) can be expressed as:
Wherein, W (D1), W (D2) represent the characteristic vector of D1 and D2, W (D1), W (D2) respectively
Representing characteristic vector W (D1), the mould of W (D2) or length respectively, W (D1) W (D2) represents two features
Vector W (D1), the dot product of W (D2).
In practice, the quantity of the potential topic class obtained may be the most.In order to ensure potential topic class
Effectiveness, all potential much-talked-about topic class that obtains of cluster can be screened by the embodiment of the present invention,
Corresponding screening technique may include that what cluster was obtained by the quantity according to two the focus webpages comprised
All potential much-talked-about topic classes carry out sequence from big to small, and choose several the potential heat come above
Point topic class is as final potential much-talked-about topic class;Or, two the focus webpages that can will be comprised
Quantity more than the potential much-talked-about topic class of class threshold value as final potential much-talked-about topic class, etc.;Can
To understand, concrete screening technique and class threshold value are not any limitation as by the embodiment of the present invention.
Step 105, using described potential much-talked-about topic class as kind of a subclass, newly-increased webpage is carried out increment gather
Class;Described newly-increased webpage specifically can be included in gauze page;
Process to newly-increased webpage, can re-start cluster on the whole data set after increase, this
Although kind of the method again clustered is simple, but it causes calculating not only for re-executing a cluster
On waste, and easily make great majority clustering algorithm based on internal memory efficiency be substantially reduced, therefore
This method again clustered typically is not used.
The embodiment of the present invention then uses increment clustering method, and increment clustering method is only to the increasing in data base
Amount part data process, and existing cluster result carry out increment type amendment with perfect.And it is right
In the process of newly-increased data, can the increase of data one by one, it is also possible to the increase of batch.
In one preferred embodiment of the invention, described using described potential much-talked-about topic class as seed
Class, carries out the step of increment cluster, specifically may include that newly-increased webpage
Sub-step S151, the vectorial matter with each potential much-talked-about topic class of web page characteristics of the newly-increased webpage of calculating
The similarity of Heart vector;
Sub-step S152, in web page characteristics vector and the matter of certain potential much-talked-about topic class of certain newly-increased webpage
When the similarity of Heart vector is more than or equal to the first similarity threshold, this newly-increased webpage is added potential to this
Much-talked-about topic class.
In one preferred embodiment of the invention, the centroid vector of described potential much-talked-about topic class can be
Process according to the web page characteristics vector weighting of the focus webpage included by potential much-talked-about topic class and obtain, its
In, user's visit capacity that weight is this focus webpage of the web page characteristics vector of certain focus webpage and this heat
Belonging to some webpage, the ratio of total user's visit capacity of potential much-talked-about topic apoplexy due to endogenous wind all focuses webpage determines.
In implementing, sub-step S151 can utilize formula (3) to try to achieve.First similarity threshold
Can be actually needed by those skilled in the art's foundation and be configured, the embodiment of the present invention is to the first concrete phase
It is not any limitation as like degree threshold value.Described weighting processes can include weighted average, moving weighted average etc.,
Concrete weighting is processed and is not any limitation as by the embodiment of the present invention.
It should be noted that the increment clustering method of sub-step S151 and sub-step S152 is intended only as excellent
Select embodiment, and the application being not understood to the embodiment of the present invention limits.
It is included in gauze page it addition, newly-increased webpage includes, is to make the embodiment of the present invention possess
The advantage of line detection;It is appreciated that in newly-increased webpage and can also include history web pages.
Step 106, for increment cluster after potential much-talked-about topic class, by analyzing the user of its correspondence
Attention rate parameter, it is determined that whether it is much-talked-about topic class.
Described potential much-talked-about topic class is likely to be of focus, it is also possible to do not have focus, step 106
It is then whether the potential much-talked-about topic class after clustering increment is that much-talked-about topic class judges.
In the embodiment of the present invention, described user's attention rate parameter specifically can include web document quantity and use
Family network behavior quantity.Wherein, described user network behavior quantity specifically can include user's visit capacity and
One or more in user's volumes of searches.
In one preferred embodiment of the invention, described for the potential much-talked-about topic after increment cluster
Class, by analyzing user's attention rate parameter of its correspondence, it is determined that whether it is the step of much-talked-about topic class
Suddenly, specifically may include that
Sub-step S161, when increment cluster after certain potential much-talked-about topic class corresponding user's attention rate ginseng
User's attention rate that the weighting result of number is corresponding with all potential much-talked-about topic class after increment cluster
When the ratio of the weighting result of parameter is more than first threshold, it is determined that this potential much-talked-about topic class is heat
Point topic class.
Wherein, first threshold can be actually needed by those skilled in the art's foundation and be configured, and the present invention is real
Execute example concrete first threshold is not any limitation as.Described weighting processes can include weighted average, movement
Weighted averages etc., concrete weighting is processed and is not any limitation as by the embodiment of the present invention.Join in user's attention rate
Number is for time multiple, and the weight of each user's attention rate parameter can be by those skilled in the art according to being actually needed
Being configured, the weight of concrete user's attention rate parameter is not any limitation as by the embodiment of the present invention.
Certainly, the decision method of sub-step S161 is intended only as preferred embodiment, actually itself it determine that
Method is also feasible, for example, it is possible to diving after all of increment being clustered according to user's attention rate parameter
Carry out sequence from big to small in much-talked-about topic class, and choose the some positions come above as focus words
Topic class etc..
In one preferred embodiment of the invention, described method can also include:
Step S201, for increment cluster after potential much-talked-about topic class, by analyzing the use of its correspondence
Family attention rate parameter change within the conventional period, it was predicted that whether it is the much-talked-about topic of subsequent period
Class.
In actual applications, Time segments division can be carried out, such as, with sky, half a day, hour, minute be
Unit carries out Time segments division, in the change within the conventional period of user's attention rate parameter can be present period
User's attention rate parameter, specifically can be with as follows relative to the change of user's attention rate parameter in the upper period
Formula represents:
User's attention rate parameter change within the conventional period=(user's attention rate parameter-upper in present period
User's attention rate parameter in period) user's attention rate parameter (4) in/upper period
In a kind of application example of the present invention, if certain the potential much-talked-about topic class pair after increment cluster
The user's attention rate parameter answered change within the conventional period is more than the 3rd threshold value, then can predict that this is potential
Much-talked-about topic class is the much-talked-about topic class of subsequent period.Wherein, the 3rd threshold value can be by those skilled in the art
Being configured according to being actually needed, the 3rd concrete threshold value is not any limitation as by the embodiment of the present invention.
It should be noted that the prediction scheme of step S201 is intended only as preferred version, it practice, depend on
The scheme being predicted the much-talked-about topic class of subsequent period according to the variation tendency of user's attention rate parameter is equal
It is feasible.
In one preferred embodiment of the invention, described method can also include:
To the much-talked-about topic class that judges or predict point out, corresponding suggestion content specifically can wrap
Include: the description key word of corresponding much-talked-about topic class.
In one preferred embodiment of the invention, described description key word specifically may include that corresponding heat
Several Feature Words that the middle co-occurrence degree of all webpages of some topic class is the highest.Wherein, certain Feature Words
Co-occurrence degree can represent by the quantity of the webpage existing for this feature word;The quantity of several correspondences described
Can be arranged according to actual demand by those skilled in the art.
Further, if the higher Feature Words of the middle co-occurrence degree of all webpages of certain much-talked-about topic class is more,
According to the weight of Feature Words order from big to small, the Feature Words that co-occurrence degree is higher can be carried out further
Screening, wherein, the weight of Feature Words may utilize formula (1) and tries to achieve.
Corresponding to preceding method embodiment, the embodiment of the invention also discloses a kind of topic based on big data
Detection device, with reference to the structure chart shown in Fig. 2, specifically may include that
Abstraction module 201, for extracting focus webpage according to user network behavioral data;
Acquisition module 202, for gathering the content of described focus webpage;
Extraction module 203, for the content according to described focus webpage, extracts and obtains described focus webpage
Web page characteristics vector;
Cluster module 204, for the web page characteristics vector according to described focus webpage to described focus webpage
Cluster, obtain corresponding potential much-talked-about topic class;
Increment cluster module 205, for using described potential much-talked-about topic class as kind of a subclass, to newly-increased net
Page carries out increment cluster;Described newly-increased webpage is included in gauze page;And
Determination module 206, the potential much-talked-about topic class after clustering for increment is right by analyzing it
The user's attention rate parameter answered, it is determined that whether it is much-talked-about topic class.
In one preferred embodiment of the invention, described user network behavioral data specifically can include using
It is one or more that family accesses in behavioral data and user's search behavior data;
The most described abstraction module 201 specifically may include that
First extraction submodule, for according to described user access activity data, obtain user's visit capacity or
User's access frequency meets the webpage of the first prerequisite, as focus webpage;And/or
Second extraction submodule, for according to described user's search behavior data, obtain user's volumes of searches or
User's search rate meets the webpage associated by the key word of the second prerequisite, as focus webpage.
In another preferred embodiment of the invention, described determination module includes:
Weighting decision sub-module, for closing as user corresponding to certain the potential much-talked-about topic class after increment cluster
The user that the weighting result of note degree parameter is corresponding with all potential much-talked-about topic class after increment cluster
When the ratio of the weighting result of attention rate parameter is more than first threshold, it is determined that this potential much-talked-about topic
Class is much-talked-about topic class.
In the still another preferable embodiment of the present invention, described increment cluster module 205 specifically can wrap
Include:
Similarity Measure submodule, for calculating, the web page characteristics of newly-increased webpage is vectorial to be talked about with each potential focus
The similarity of the centroid vector of topic class;
Comparison sub-module, vectorial and certain potential much-talked-about topic class for the web page characteristics at certain newly-increased webpage
When the similarity of centroid vector is more than or equal to the first similarity threshold, this newly-increased webpage is added and dives to this
In much-talked-about topic class.
In one preferred embodiment of the invention, the centroid vector of described potential much-talked-about topic class can be right
The web page characteristics vector of the focus webpage included by potential much-talked-about topic class is weighted process and obtains, its
In, the weight of the web page characteristics vector of certain focus webpage can be the user's visit capacity according to this focus webpage
With this focus webpage belonging to the ratio of total user's visit capacity of potential much-talked-about topic apoplexy due to endogenous wind all focuses webpage
Determine.
In another preferred embodiment of the invention, described device can also include:
Prediction module, the potential much-talked-about topic class after clustering for increment, by analyzing its correspondence
User's attention rate parameter change within the conventional period, it was predicted that whether it is the much-talked-about topic of subsequent period
Class.
In embodiments of the present invention, it is preferred that described device can also include:
Reminding module, for pointing out the much-talked-about topic class that judges or predict, points out interior accordingly
Appearance includes: the description key word of corresponding much-talked-about topic class.
In one preferred embodiment of the invention, described description key word specifically may include that corresponding heat
Several Feature Words that the middle co-occurrence degree of all webpages of some topic class is the highest.
In one preferred embodiment of the invention, described user's attention rate parameter specifically can include webpage
Number of documents and user network behavior quantity.
Each embodiment in this specification all uses the mode gone forward one by one to describe, and each embodiment emphasis is said
Bright is all the difference with other embodiments, and between each embodiment, identical similar part is mutual
See.For device embodiment, due to itself and embodiment of the method basic simlarity, so retouching
That states is fairly simple, and relevant part sees the part of embodiment of the method and illustrates.
Those skilled in the art it should be appreciated that embodiments of the invention can be provided as method, system,
Or computer program.Therefore, the present invention can use complete hardware embodiment, complete software to implement
Example or the form of the embodiment in terms of combining software and hardware.And, the present invention can use at one
Or the multiple computer-usable storage medium wherein including computer usable program code (includes but does not limits
In disk memory, CD-ROM, optical memory etc.) shape of the upper computer program implemented
Formula.
The present invention is with reference to method, equipment (system) and computer program according to embodiments of the present invention
The flow chart of product and/or block diagram describe.It should be understood that stream can be realized by computer program instructions
In each flow process in journey figure and/or block diagram and/or square frame and flow chart and/or block diagram
Flow process and/or the combination of square frame.These computer program instructions can be provided to general purpose computer, specially
With the processor of computer, Embedded Processor or other programmable data processing device to produce one
Machine so that the instruction performed by the processor of computer or other programmable data processing device is produced
Raw for realizing one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple side
The device of the function specified in frame.
These computer program instructions may be alternatively stored in and computer or other programmable datas can be guided to process
In the computer-readable memory that equipment works in a specific way so that be stored in this computer-readable and deposit
Instruction in reservoir produces the manufacture including command device, and this command device realizes flow chart one
The function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded into computer or other programmable data processing device
On so that on computer or other programmable devices, perform sequence of operations step to produce computer
The process realized, thus the instruction performed on computer or other programmable devices provides and is used for realizing
One flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame are specified
The step of function.
Although preferred embodiments of the present invention have been described, but those skilled in the art once learn
Basic creative concept, then can make other change and amendment to these embodiments.So, institute
Attached claim be intended to be construed to include preferred embodiment and all changes falling into the scope of the invention and
Amendment.
Above to a kind of topic detection method and devices based on big data provided by the present invention, carry out
Being discussed in detail, principle and the embodiment of the present invention are explained by specific case used herein
Stating, the explanation of above example is only intended to help to understand method and the core concept thereof of the present invention;With
Time, for one of ordinary skill in the art, according to the thought of the present invention, in detailed description of the invention and
All will change in range of application, in sum, this specification content should not be construed as this
Bright restriction.