CN103177090B - A kind of topic detection method and device based on big data - Google Patents

A kind of topic detection method and device based on big data Download PDF

Info

Publication number
CN103177090B
CN103177090B CN201310075129.4A CN201310075129A CN103177090B CN 103177090 B CN103177090 B CN 103177090B CN 201310075129 A CN201310075129 A CN 201310075129A CN 103177090 B CN103177090 B CN 103177090B
Authority
CN
China
Prior art keywords
webpage
talked
user
much
topic class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310075129.4A
Other languages
Chinese (zh)
Other versions
CN103177090A (en
Inventor
罗峰
黄苏支
李娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IZP (BEIJING) TECHNOLOGIES Co.,Ltd.
Original Assignee
IZP (BEIJING) TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by IZP (BEIJING) TECHNOLOGIES Co Ltd filed Critical IZP (BEIJING) TECHNOLOGIES Co Ltd
Priority to CN201310075129.4A priority Critical patent/CN103177090B/en
Publication of CN103177090A publication Critical patent/CN103177090A/en
Application granted granted Critical
Publication of CN103177090B publication Critical patent/CN103177090B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a kind of topic detection method and device based on big data, it is possible in the case of a large amount of web page texts quickly update in internet environment, ensure the accuracy of Detection results and ageing simultaneously.Described method includes: extract focus webpage according to user network behavioral data;Gather the content of described focus webpage;According to the content of described focus webpage, extract the web page characteristics vector obtaining described focus webpage;According to the web page characteristics vector of described focus webpage, described focus webpage is clustered, obtain corresponding potential much-talked-about topic class;Using described potential much-talked-about topic class as kind of a subclass, newly-increased webpage is carried out increment cluster;Described newly-increased webpage is included in gauze page;Potential much-talked-about topic class after clustering for increment, by analyzing user's attention rate parameter of its correspondence, it is determined that whether it is much-talked-about topic class.

Description

A kind of topic detection method and device based on big data
Technical field
The present invention relates to internet information processing technology field, particularly relate to a kind of based on big data Topic detection method and device.
Background technology
Along with the high speed development of the Internet, the information on network is more and more polynary and abundant, same with this Time, the social effectiveness of network public-opinion constantly strengthens, and a lot of social hotspots events are all in a network One the time disclose and propagate, network topics detection thus more show its important value.In the Internet In environment, there is the web page text of a large amount of natural language form, its type includes news, blog, opinion Altar model and emerging microblogging etc., these web page texts are to find that much-talked-about topic provides most basic Data Source.
The TDT(topic detection and tracking that U.S. Department of Defense carries out, Topic Detection and Tracking) research in terms of project expands topic detection the earliest, and make some progress.
The time carried out according to topic detection, current topic detecting method can be divided into backtracking detection and On-line checking two kinds.Wherein, first backtracking detection obtains whole webpages, then to the webpage obtained Text utilizes traditional Text Clustering Algorithm to cluster, the topic wherein comprised with discovery;Online inspection Survey the starting position then identifying newspeak topic with online form from the web page text stream got in real time, And new topic is joined in existing topic.
It is good and bad that above two topic detecting method is respectively arranged with it.Wherein, the advantage of backtracking detection method is Some effects preferably text mining algorithm web data to collecting can be selected to carry out at off-line Reason, therefore the result that can be more optimized, but web data is processed off-line manner due to it, Therefore the shortcoming of its maximum is poor in timeliness;Online test method is now subjected to more and more pay close attention to, its The demand that much-talked-about topic detects in real time can be met, but owing to being processed the constraint of time, it is used Algorithm the most fairly simple, therefore with backtracking detection method is compared Detection results and be there is also certain gap.
In a word, the technical problem needing those skilled in the art urgently to solve is exactly: how can Solve in the case of in internet environment, a large amount of web page texts quickly update, the inspection that topic detection faces Survey effect accuracy and ageing sharp contradiction.
Summary of the invention
The technical problem to be solved be to provide a kind of topic detecting method based on big data and Device, it is possible in the case of a large amount of web page texts quickly update in internet environment, ensures inspection simultaneously Survey the accuracy of effect and ageing.
In order to solve the problems referred to above, the invention discloses a kind of topic detecting method based on big data, Including:
Focus webpage is extracted according to user network behavioral data;
Gather the content of described focus webpage;
According to the content of described focus webpage, extract the web page characteristics vector obtaining described focus webpage;
According to the web page characteristics vector of described focus webpage, described focus webpage is clustered, obtain phase The potential much-talked-about topic class answered;
Using described potential much-talked-about topic class as kind of a subclass, newly-increased webpage is carried out increment cluster;Described Newly-increased webpage is included in gauze page;
Potential much-talked-about topic class after clustering for increment, is joined by the user's attention rate analyzing its correspondence Number, it is determined that whether it is much-talked-about topic class.
Optionally, described user network behavioral data includes user access activity data and user's search behavior One or more in data;The most described step according to user network behavioral data extraction focus webpage Suddenly, including: according to described user access activity data, obtain user's visit capacity or user's access frequency Meet the webpage of the first prerequisite, as focus webpage;And/or, according to described user's search behavior number According to, acquisition user's volumes of searches or user's search rate meet associated by the key word of the second prerequisite Webpage, as focus webpage.
Optionally, described for the potential much-talked-about topic class after increment cluster, by analyzing the use of its correspondence Family attention rate parameter, it is determined that whether it is the step of much-talked-about topic class, including: after increment clusters After the weighting result of user's attention rate parameter that certain potential much-talked-about topic class is corresponding clusters with increment The ratio of the weighting result of user's attention rate parameter that all potential much-talked-about topic classes are corresponding is more than the During one threshold value, it is determined that this potential much-talked-about topic class is much-talked-about topic class.
Optionally, described using described potential much-talked-about topic class as kind of a subclass, newly-increased webpage is carried out increment The step of cluster, including: the web page characteristics calculating newly-increased webpage is vectorial and each potential much-talked-about topic class The similarity of centroid vector;Web page characteristics vector and certain potential much-talked-about topic class at certain newly-increased webpage When the similarity of centroid vector is more than or equal to the first similarity threshold, this newly-increased webpage is added and dives to this In much-talked-about topic class.
Optionally, the centroid vector of described potential much-talked-about topic class is to included by potential much-talked-about topic class The web page characteristics vector of focus webpage is weighted process and obtains, and wherein, the webpage of certain focus webpage is special The weight levying vector is potential focus belonging to the user's visit capacity according to this focus webpage and this focus webpage The ratio of total user's visit capacity of topic apoplexy due to endogenous wind all focuses webpage determines.
Optionally, also include: the potential much-talked-about topic class after clustering for increment, corresponding by analyzing it The change within the conventional period of user's attention rate parameter, it was predicted that whether it is the focus words of subsequent period Topic class.
Optionally, also include: to the much-talked-about topic class that judges or predict point out, point out accordingly Content includes: the description key word of corresponding much-talked-about topic class.
Optionally, described description key word includes: the middle co-occurrence degree of all webpages of corresponding much-talked-about topic class Several the highest Feature Words.
Optionally, described user's attention rate parameter includes web document quantity and user network behavior quantity.
Accordingly, the invention also discloses a kind of topic detection device based on big data, including:
Abstraction module, for extracting focus webpage according to user network behavioral data;
Acquisition module, for gathering the content of described focus webpage;
Extraction module, for the content according to described focus webpage, extracts the net obtaining described focus webpage Page characteristic vector;
Cluster module, for carrying out described focus webpage according to the web page characteristics of described focus webpage vector Cluster, obtains corresponding potential much-talked-about topic class;
Increment cluster module, for using described potential much-talked-about topic class as kind of a subclass, enters newly-increased webpage Row increment clusters;Described newly-increased webpage is included in gauze page;And
Determination module, the potential much-talked-about topic class after clustering for increment, by analyzing its correspondence User's attention rate parameter, it is determined that whether it is much-talked-about topic class.
Compared with prior art, the embodiment of the present invention has the advantage that
The data that the testing process of the embodiment of the present invention is used both can include that focus webpage was such and go through History web data, can be included in again gauze page data, therefore the embodiment of the present invention can be provided simultaneously with back Trace back and detect and the respective advantage of on-line checking, both possessed the effect of backtracking detection, and possessed again on-line checking Ageing;Further, since the focus webpage that embodiment of the present invention detection is used is for according to user network The extraction of network behavioral data obtains, and its data volume is little, therefore ensure that detection efficiency;Therefore, this Bright embodiment can ensure in the case of in internet environment, a large amount of web page texts quickly update simultaneously The accuracy of Detection results, ageing and high efficiency.
Accompanying drawing explanation
Fig. 1 is the flow chart of a kind of topic detecting method embodiments based on big data of the embodiment of the present invention;
Fig. 2 is the structure chart of a kind of topic detection device embodiments based on big data of the embodiment of the present invention.
Detailed description of the invention
Understandable for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from, below in conjunction with attached The present invention is further detailed explanation with detailed description of the invention for figure.
Big data (big data), or claim flood tide data, refer to that involved data quantity is huge to be arrived Current main software instrument cannot be passed through, reaching within reasonable time to capture, manage, processing and whole Reason becomes the information helping the more positive purpose of enterprise management decision-making, and it is frequently used for society's emotion statistics field As in Social Public Feelings or the will of the people add up, in order to find much-talked-about topic.
Wherein, the topic that much-talked-about topic is paid close attention to by numerous users often, i.e. if user's attention rate height Topic, it produces with the concern of users is inseparable, and therefore, user behavior is in much-talked-about topic Detection during have important effect.
User network behavior mainly includes user access activity and user's search behavior.Wherein, Yong Hufang The behavior of asking can embody behavioural habits or the personal interest of user, and from the overall situation, multiple users' Access behavior then can embody user to some or the attention rate of class webpage.And user's search behavior is User inputs the action that keyword scans in a search engine, and it can accurately express the meaning of user Figure, and often the page in Search Results will be produced access behavior after user's search behavior.Once searching In the page access behavior of rope and association, the search key word of user may be used for institute's accession page Theme feature describes, therefore from the overall situation, the search behavior of multiple users also is able to embody user to certain The concern of a little key words.
Therefore, the embodiment of the present invention using user network behavioral data as the important evidence of topic detection, Corresponding testing process specifically may include that and is first depending on user network behavioral data extraction focus webpage Then these focus webpages are entered by (described focus webpage can be used for representing the webpage that user's attention rate is high) Row cluster, (described potential much-talked-about topic class is likely to be of focus to obtain corresponding potential much-talked-about topic class Property), and using described potential much-talked-about topic class as kind of a subclass, newly-increased webpage is carried out increment cluster, Potential much-talked-about topic class after finally clustering for increment, is joined by the user's attention rate analyzing its correspondence Number, it is determined that whether it is much-talked-about topic class.Owing to described newly-increased webpage both can include history web pages also Gauze page can be included in, it is seen then that the data that the testing process of the embodiment of the present invention is used are the most permissible Including focus webpage such history web pages data, gauze page data can be included in again, therefore the present invention Embodiment can be provided simultaneously with backtracking detection and the respective advantage of on-line checking, had both possessed backtracking detection Effect, possesses again the ageing of on-line checking;Further, since what embodiment of the present invention detection was used Focus webpage is for obtaining according to the extraction of user network behavioral data, and its data volume is little, therefore ensure that Detection efficiency;Therefore, the embodiment of the present invention can a large amount of web page texts be quickly the most more in internet environment Under news, ensure the accuracy of Detection results, ageing and high efficiency simultaneously.
With reference to Fig. 1, it is shown that a kind of topic detecting method embodiments based on big data of the embodiment of the present invention Flow chart, specifically may include that
Step 101, foundation user network behavioral data extract focus webpage;
In the art, user network behavioral data can be used for characterizing the main number of user network behavior According to, it can come from the journal file collection in the webserver of operator or website, these journal files Collection i.e. can be considered big data as herein described.These journal file collection contain provider customer or website The HTTP(HTML (Hypertext Markup Language) of user, Hypertext transfer protocol) affairs perform note Record, can be similar to this technology of network packet sniff technology from journal file concentration by utilization Obtain the data for characterizing user network behavior.
Specifically, the user behavior data that journal file is concentrated mainly comprises user's search behavior data With user access activity data.Wherein, described user's search behavior data record has the search of user to close Keyword and corresponding result of page searching, described user access activity data record has the access page of user Face, described result of page searching and accession page are generally with URL(URL, Uniform Resource Locator) form record.Concrete, some operator or the user access activity of website Data also can record user's physical address and search jump information, some operator or the user of website Search behavior data also can record the hyperlink letter of webpage in user's physical address and result of page searching Breath;Here, user's physical address mainly can include User IP ((between network interconnection agreement, Internet Protocol) address, search jump information is used for representing whether the current accessed page derives from and searches Rope results page, and, when deriving from result of page searching, result of page searching also can be recorded Information (such as page address etc.).
In one preferred embodiment of the invention, described user network behavioral data specifically can include One or more in user access activity data and user's search behavior data;
The most described step according to user network behavioral data extraction focus webpage, specifically may include that
Sub-step S111, according to described user access activity data, obtain user's visit capacity or user visit Ask that frequency meets the webpage of the first prerequisite, as focus webpage;And/or
Sub-step S112, according to described user's search behavior data, obtain user's volumes of searches or user search Rope frequency meets the webpage associated by the key word of the second prerequisite, as focus webpage.
Wherein, described first prerequisite can be user's visit capacity or user's access frequency comes front K1 Position, described second prerequisite can be user's volumes of searches or user's search rate comes front K2 position;This Skilled person can be according to preset described K1 and K2 of actual demand, and the embodiment of the present invention is to specifically The numerical value of K1 and K2 be not any limitation as.
In a kind of application example of the embodiment of the present invention, user access activity data can represent For:<(time1, url1), (time2, url2) ..., (timen, urln)>, wherein time1 ... timen and Url1 ... urln represents time and the URL of access respectively;In actual applications, by a large number of users Access the analysis of behavioral data, the webpage URL of K1 before ranking can be obtained, be represented by (ti, < (url1, visitors1),….,(urlk1,visitorsk1)>)。
User's search behavior data can be expressed as: < (time1, se1, keyword1), (and time2, se2, Keyword2) ... (timen, sen, keywordn) >, wherein, time1 ... when timen represents search Between, se1 ... sen represents the search engine of use, keyword1 ... keywordn represents that search is closed Keyword;By the analysis to a large number of users search behavior data, before can obtaining ranking, the search of K2 is closed Keyword, is represented by ti, and<(keywords1, num1) ..., (keywordsk2, numk2)>).
In above-mentioned parameter, ti represents the appointment time period, url1 and visitor1 represents url1 and visit thereof respectively The amount of asking, keywords1 and num1 represents search key word and volumes of searches thereof respectively.
On the basis of obtaining Top K2 search key word, each Top K2 of acquisition can be analyzed further The webpage URL of search key word association, can be expressed as (keywords, < (url1, visitors1) ..., (urlk,visitk)>)。
Step 102, gather the content of described focus webpage;
In actual applications, the technology such as web crawlers can be used to gather the content of described focus webpage, Concrete acquisition method is not any limitation as by the embodiment of the present invention.
Step 103, content according to described focus webpage, extract that to obtain the webpage of described focus webpage special Levy vector;
In a kind of application example of the present invention, the described content according to described focus webpage, extract To the step of the web page characteristics vector of described focus webpage, specifically may include that
The content that sub-step S131, foundation collect, the web page contents extracting described focus webpage is special Levy;
The web page contents collected can be resolved by sub-step S131, obtains web page title, webpage The characteristic informations such as text, webpage description.
Sub-step S132, according to focus webpage described in described web page contents feature construction web page characteristics to Amount.
Sub-step S132 can carry out participle and part of speech according to the acquisition PRELIMINARY RESULTS of web page contents feature The work such as mark, carry out stop words filtration etc. and process, and the content lexical set after process can be as structure The foundation of networking page characteristic vector.
In one preferred embodiment of the invention, VSM type (vector space mould, Vector can be used Space Model) as Text Representation, document representation is become a vector by VSM, and vectorial is every One Feature Words of one-dimensional representation;It specifically can pass through TF*IDF(word frequency-reverse document-frequency, term Frequency inverse document frequency) carry out the weight of defined feature word:
wi=tfsi×log(N/ni) (1)
Wherein, wi represents the weight of lexical item ti, and tfsi represents lexical item ti important journey in current web page Degree, N represents the web document quantity included by described focus webpage correspondence background corpus, and ni represents the back of the body Scape corpus comprises the web document quantity of ti.
In one preferred embodiment of the invention, can consider that lexical item ti is at web page title, net respectively The number of times occurred in page content and webpage statement, and seek weighted sum by importance, to obtain tfsi, accordingly Computing formula as follows:
tfsi=pi×α+mi×β+ci× γ (2)
Wherein, during pi, mi, ci represent that lexical item ti is in web page title, web page contents and webpage are stated respectively The number of times occurred, α, beta, gamma represents respective weight the most respectively.
For reducing intrinsic dimensionality, simplify and calculate, and prevent the phenomenons such as over-fitting, in the one of the present invention Plant in preferred embodiment, according to the size of weighted value, lexical item ti of certain focus webpage can be arranged Sequence, and weight selection value is more than specifying lexical item ti of threshold value w as Feature Words, owning of certain focus webpage Feature Words constitutes corresponding web page characteristics vector.Appointment threshold value w therein can be by those skilled in the art Preset according to actual demand, concrete appointment threshold value w is not any limitation as by the embodiment of the present invention.
It should be noted that above-mentioned VSM is intended only as a kind of net building described focus webpage of the present invention The preferred embodiment of page characteristic vector, the application being not intended as the embodiment of the present invention limits.
Step 104, according to described focus webpage web page characteristics vector described focus webpage is gathered Class, obtains corresponding potential much-talked-about topic class;
User's attention rate height is a key character of much-talked-about topic, and therefore, the embodiment of the present invention is by poly- Class obtains the potential much-talked-about topic class that user's attention rate is high.It should be noted that described potential much-talked-about topic Class is likely to be of focus, it is also possible to do not have focus, needs in follow-up testing process further Judge.
Can being described as of cluster: the object that the set of physics or abstract object is divided into by being similar to is formed The process of multiple classes be referred to as cluster.Generated by cluster bunch is the set of one group of data object, These objects are the most similar to the object in same bunch, different with the object in other bunches.
Traditional clustering method specifically may include that division methods, hierarchical method, side based on density Method, method based on grid, method based on model, Transitive Closure Method, Boolean matrix method, directly Clustering procedure, correlation analysis clustering procedure and clustering method based on statistics etc..
In one preferred embodiment of the invention, the K average in division methods can be used (K-Means) clustering method, the basic thought of K mean cluster is: accept input quantity K;Then N data object is divided into K cluster so that the cluster obtained meets: in same cluster Object similarity higher;And object similarity in different cluster is less.
In a kind of application example of the present invention, the process that realizes of K mean cluster specifically may include that First from all focus webpages, select K the center as K initial clustering;For other focus Webpage, then according to the similarity at they centers with initial clustering, assign these to its phase respectively Near initial clustering;Recalculate cluster centre (all heat in cluster accordingly of each new cluster the most again The average of some webpage);Constantly repeat this process until canonical measure function (such as mean square deviation) starts Till convergence.
In implementing, K value can be actually needed by those skilled in the art's foundation and be configured;Can VSM is utilized to calculate similarity sim (D1, D2) of certain focus webpage D1 and the center D2 of certain cluster, When this similarity is more than a certain similarity threshold, this focus webpage can be distributed to this cluster, this This concrete similarity threshold is not any limitation as by bright embodiment.,
In a kind of application example of the present invention, sim (D1, D2) can be expressed as:
sim ( D 1 , D 2 ) = W ( D 1 ) &CenterDot; W ( D 2 ) | W ( D 1 ) | | W ( D 2 ) | - - - ( 3 )
Wherein, W (D1), W (D2) represent the characteristic vector of D1 and D2, W (D1), W (D2) respectively Representing characteristic vector W (D1), the mould of W (D2) or length respectively, W (D1) W (D2) represents two features Vector W (D1), the dot product of W (D2).
In practice, the quantity of the potential topic class obtained may be the most.In order to ensure potential topic class Effectiveness, all potential much-talked-about topic class that obtains of cluster can be screened by the embodiment of the present invention, Corresponding screening technique may include that what cluster was obtained by the quantity according to two the focus webpages comprised All potential much-talked-about topic classes carry out sequence from big to small, and choose several the potential heat come above Point topic class is as final potential much-talked-about topic class;Or, two the focus webpages that can will be comprised Quantity more than the potential much-talked-about topic class of class threshold value as final potential much-talked-about topic class, etc.;Can To understand, concrete screening technique and class threshold value are not any limitation as by the embodiment of the present invention.
Step 105, using described potential much-talked-about topic class as kind of a subclass, newly-increased webpage is carried out increment gather Class;Described newly-increased webpage specifically can be included in gauze page;
Process to newly-increased webpage, can re-start cluster on the whole data set after increase, this Although kind of the method again clustered is simple, but it causes calculating not only for re-executing a cluster On waste, and easily make great majority clustering algorithm based on internal memory efficiency be substantially reduced, therefore This method again clustered typically is not used.
The embodiment of the present invention then uses increment clustering method, and increment clustering method is only to the increasing in data base Amount part data process, and existing cluster result carry out increment type amendment with perfect.And it is right In the process of newly-increased data, can the increase of data one by one, it is also possible to the increase of batch.
In one preferred embodiment of the invention, described using described potential much-talked-about topic class as seed Class, carries out the step of increment cluster, specifically may include that newly-increased webpage
Sub-step S151, the vectorial matter with each potential much-talked-about topic class of web page characteristics of the newly-increased webpage of calculating The similarity of Heart vector;
Sub-step S152, in web page characteristics vector and the matter of certain potential much-talked-about topic class of certain newly-increased webpage When the similarity of Heart vector is more than or equal to the first similarity threshold, this newly-increased webpage is added potential to this Much-talked-about topic class.
In one preferred embodiment of the invention, the centroid vector of described potential much-talked-about topic class can be Process according to the web page characteristics vector weighting of the focus webpage included by potential much-talked-about topic class and obtain, its In, user's visit capacity that weight is this focus webpage of the web page characteristics vector of certain focus webpage and this heat Belonging to some webpage, the ratio of total user's visit capacity of potential much-talked-about topic apoplexy due to endogenous wind all focuses webpage determines.
In implementing, sub-step S151 can utilize formula (3) to try to achieve.First similarity threshold Can be actually needed by those skilled in the art's foundation and be configured, the embodiment of the present invention is to the first concrete phase It is not any limitation as like degree threshold value.Described weighting processes can include weighted average, moving weighted average etc., Concrete weighting is processed and is not any limitation as by the embodiment of the present invention.
It should be noted that the increment clustering method of sub-step S151 and sub-step S152 is intended only as excellent Select embodiment, and the application being not understood to the embodiment of the present invention limits.
It is included in gauze page it addition, newly-increased webpage includes, is to make the embodiment of the present invention possess The advantage of line detection;It is appreciated that in newly-increased webpage and can also include history web pages.
Step 106, for increment cluster after potential much-talked-about topic class, by analyzing the user of its correspondence Attention rate parameter, it is determined that whether it is much-talked-about topic class.
Described potential much-talked-about topic class is likely to be of focus, it is also possible to do not have focus, step 106 It is then whether the potential much-talked-about topic class after clustering increment is that much-talked-about topic class judges.
In the embodiment of the present invention, described user's attention rate parameter specifically can include web document quantity and use Family network behavior quantity.Wherein, described user network behavior quantity specifically can include user's visit capacity and One or more in user's volumes of searches.
In one preferred embodiment of the invention, described for the potential much-talked-about topic after increment cluster Class, by analyzing user's attention rate parameter of its correspondence, it is determined that whether it is the step of much-talked-about topic class Suddenly, specifically may include that
Sub-step S161, when increment cluster after certain potential much-talked-about topic class corresponding user's attention rate ginseng User's attention rate that the weighting result of number is corresponding with all potential much-talked-about topic class after increment cluster When the ratio of the weighting result of parameter is more than first threshold, it is determined that this potential much-talked-about topic class is heat Point topic class.
Wherein, first threshold can be actually needed by those skilled in the art's foundation and be configured, and the present invention is real Execute example concrete first threshold is not any limitation as.Described weighting processes can include weighted average, movement Weighted averages etc., concrete weighting is processed and is not any limitation as by the embodiment of the present invention.Join in user's attention rate Number is for time multiple, and the weight of each user's attention rate parameter can be by those skilled in the art according to being actually needed Being configured, the weight of concrete user's attention rate parameter is not any limitation as by the embodiment of the present invention.
Certainly, the decision method of sub-step S161 is intended only as preferred embodiment, actually itself it determine that Method is also feasible, for example, it is possible to diving after all of increment being clustered according to user's attention rate parameter Carry out sequence from big to small in much-talked-about topic class, and choose the some positions come above as focus words Topic class etc..
In one preferred embodiment of the invention, described method can also include:
Step S201, for increment cluster after potential much-talked-about topic class, by analyzing the use of its correspondence Family attention rate parameter change within the conventional period, it was predicted that whether it is the much-talked-about topic of subsequent period Class.
In actual applications, Time segments division can be carried out, such as, with sky, half a day, hour, minute be Unit carries out Time segments division, in the change within the conventional period of user's attention rate parameter can be present period User's attention rate parameter, specifically can be with as follows relative to the change of user's attention rate parameter in the upper period Formula represents:
User's attention rate parameter change within the conventional period=(user's attention rate parameter-upper in present period User's attention rate parameter in period) user's attention rate parameter (4) in/upper period
In a kind of application example of the present invention, if certain the potential much-talked-about topic class pair after increment cluster The user's attention rate parameter answered change within the conventional period is more than the 3rd threshold value, then can predict that this is potential Much-talked-about topic class is the much-talked-about topic class of subsequent period.Wherein, the 3rd threshold value can be by those skilled in the art Being configured according to being actually needed, the 3rd concrete threshold value is not any limitation as by the embodiment of the present invention.
It should be noted that the prediction scheme of step S201 is intended only as preferred version, it practice, depend on The scheme being predicted the much-talked-about topic class of subsequent period according to the variation tendency of user's attention rate parameter is equal It is feasible.
In one preferred embodiment of the invention, described method can also include:
To the much-talked-about topic class that judges or predict point out, corresponding suggestion content specifically can wrap Include: the description key word of corresponding much-talked-about topic class.
In one preferred embodiment of the invention, described description key word specifically may include that corresponding heat Several Feature Words that the middle co-occurrence degree of all webpages of some topic class is the highest.Wherein, certain Feature Words Co-occurrence degree can represent by the quantity of the webpage existing for this feature word;The quantity of several correspondences described Can be arranged according to actual demand by those skilled in the art.
Further, if the higher Feature Words of the middle co-occurrence degree of all webpages of certain much-talked-about topic class is more, According to the weight of Feature Words order from big to small, the Feature Words that co-occurrence degree is higher can be carried out further Screening, wherein, the weight of Feature Words may utilize formula (1) and tries to achieve.
Corresponding to preceding method embodiment, the embodiment of the invention also discloses a kind of topic based on big data Detection device, with reference to the structure chart shown in Fig. 2, specifically may include that
Abstraction module 201, for extracting focus webpage according to user network behavioral data;
Acquisition module 202, for gathering the content of described focus webpage;
Extraction module 203, for the content according to described focus webpage, extracts and obtains described focus webpage Web page characteristics vector;
Cluster module 204, for the web page characteristics vector according to described focus webpage to described focus webpage Cluster, obtain corresponding potential much-talked-about topic class;
Increment cluster module 205, for using described potential much-talked-about topic class as kind of a subclass, to newly-increased net Page carries out increment cluster;Described newly-increased webpage is included in gauze page;And
Determination module 206, the potential much-talked-about topic class after clustering for increment is right by analyzing it The user's attention rate parameter answered, it is determined that whether it is much-talked-about topic class.
In one preferred embodiment of the invention, described user network behavioral data specifically can include using It is one or more that family accesses in behavioral data and user's search behavior data;
The most described abstraction module 201 specifically may include that
First extraction submodule, for according to described user access activity data, obtain user's visit capacity or User's access frequency meets the webpage of the first prerequisite, as focus webpage;And/or
Second extraction submodule, for according to described user's search behavior data, obtain user's volumes of searches or User's search rate meets the webpage associated by the key word of the second prerequisite, as focus webpage.
In another preferred embodiment of the invention, described determination module includes:
Weighting decision sub-module, for closing as user corresponding to certain the potential much-talked-about topic class after increment cluster The user that the weighting result of note degree parameter is corresponding with all potential much-talked-about topic class after increment cluster When the ratio of the weighting result of attention rate parameter is more than first threshold, it is determined that this potential much-talked-about topic Class is much-talked-about topic class.
In the still another preferable embodiment of the present invention, described increment cluster module 205 specifically can wrap Include:
Similarity Measure submodule, for calculating, the web page characteristics of newly-increased webpage is vectorial to be talked about with each potential focus The similarity of the centroid vector of topic class;
Comparison sub-module, vectorial and certain potential much-talked-about topic class for the web page characteristics at certain newly-increased webpage When the similarity of centroid vector is more than or equal to the first similarity threshold, this newly-increased webpage is added and dives to this In much-talked-about topic class.
In one preferred embodiment of the invention, the centroid vector of described potential much-talked-about topic class can be right The web page characteristics vector of the focus webpage included by potential much-talked-about topic class is weighted process and obtains, its In, the weight of the web page characteristics vector of certain focus webpage can be the user's visit capacity according to this focus webpage With this focus webpage belonging to the ratio of total user's visit capacity of potential much-talked-about topic apoplexy due to endogenous wind all focuses webpage Determine.
In another preferred embodiment of the invention, described device can also include:
Prediction module, the potential much-talked-about topic class after clustering for increment, by analyzing its correspondence User's attention rate parameter change within the conventional period, it was predicted that whether it is the much-talked-about topic of subsequent period Class.
In embodiments of the present invention, it is preferred that described device can also include:
Reminding module, for pointing out the much-talked-about topic class that judges or predict, points out interior accordingly Appearance includes: the description key word of corresponding much-talked-about topic class.
In one preferred embodiment of the invention, described description key word specifically may include that corresponding heat Several Feature Words that the middle co-occurrence degree of all webpages of some topic class is the highest.
In one preferred embodiment of the invention, described user's attention rate parameter specifically can include webpage Number of documents and user network behavior quantity.
Each embodiment in this specification all uses the mode gone forward one by one to describe, and each embodiment emphasis is said Bright is all the difference with other embodiments, and between each embodiment, identical similar part is mutual See.For device embodiment, due to itself and embodiment of the method basic simlarity, so retouching That states is fairly simple, and relevant part sees the part of embodiment of the method and illustrates.
Those skilled in the art it should be appreciated that embodiments of the invention can be provided as method, system, Or computer program.Therefore, the present invention can use complete hardware embodiment, complete software to implement Example or the form of the embodiment in terms of combining software and hardware.And, the present invention can use at one Or the multiple computer-usable storage medium wherein including computer usable program code (includes but does not limits In disk memory, CD-ROM, optical memory etc.) shape of the upper computer program implemented Formula.
The present invention is with reference to method, equipment (system) and computer program according to embodiments of the present invention The flow chart of product and/or block diagram describe.It should be understood that stream can be realized by computer program instructions In each flow process in journey figure and/or block diagram and/or square frame and flow chart and/or block diagram Flow process and/or the combination of square frame.These computer program instructions can be provided to general purpose computer, specially With the processor of computer, Embedded Processor or other programmable data processing device to produce one Machine so that the instruction performed by the processor of computer or other programmable data processing device is produced Raw for realizing one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple side The device of the function specified in frame.
These computer program instructions may be alternatively stored in and computer or other programmable datas can be guided to process In the computer-readable memory that equipment works in a specific way so that be stored in this computer-readable and deposit Instruction in reservoir produces the manufacture including command device, and this command device realizes flow chart one The function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded into computer or other programmable data processing device On so that on computer or other programmable devices, perform sequence of operations step to produce computer The process realized, thus the instruction performed on computer or other programmable devices provides and is used for realizing One flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame are specified The step of function.
Although preferred embodiments of the present invention have been described, but those skilled in the art once learn Basic creative concept, then can make other change and amendment to these embodiments.So, institute Attached claim be intended to be construed to include preferred embodiment and all changes falling into the scope of the invention and Amendment.
Above to a kind of topic detection method and devices based on big data provided by the present invention, carry out Being discussed in detail, principle and the embodiment of the present invention are explained by specific case used herein Stating, the explanation of above example is only intended to help to understand method and the core concept thereof of the present invention;With Time, for one of ordinary skill in the art, according to the thought of the present invention, in detailed description of the invention and All will change in range of application, in sum, this specification content should not be construed as this Bright restriction.

Claims (8)

1. a topic detecting method based on big data, it is characterised in that including:
Focus webpage is extracted according to user network behavioral data;
Gather the content of described focus webpage;
According to the content of described focus webpage, extract the web page characteristics vector obtaining described focus webpage;
The described content according to described focus webpage, extracts the web page characteristics vector obtaining described focus webpage Step include: according to collecting the content of webpage, extract the web page contents feature of described focus webpage; Web page characteristics vector according to focus webpage described in described web page contents feature construction;
According to the web page characteristics vector of described focus webpage, described focus webpage is clustered, obtain phase The potential much-talked-about topic class answered;
Using described potential much-talked-about topic class as kind of a subclass, newly-increased webpage is carried out increment cluster;Described Newly-increased webpage is included in gauze page;
Potential much-talked-about topic class after clustering for increment, is joined by the user's attention rate analyzing its correspondence Number, it is determined that whether it is much-talked-about topic class;
Described for the potential much-talked-about topic class after increment cluster, pay close attention to by analyzing the user of its correspondence Degree parameter, it is determined that whether it is the step of much-talked-about topic class, including: certain after increment clusters is potential The weighting result of user's attention rate parameter that much-talked-about topic class is corresponding and increment cluster after all latent At the ratio of weighting result of user's attention rate parameter corresponding to much-talked-about topic class more than first threshold Time, it is determined that this potential much-talked-about topic class is much-talked-about topic class;
Described user network behavioral data includes in user access activity data and user's search behavior data One or more;
The most described step according to user network behavioral data extraction focus webpage, including:
According to described user access activity data, obtain user's visit capacity or user's access frequency meets the The webpage of one prerequisite, as focus webpage;And/or
According to described user's search behavior data, obtain user's volumes of searches or user's search rate meets the Webpage associated by the key word of two prerequisites, as focus webpage.
2. the method for claim 1, it is characterised in that described with described potential much-talked-about topic Class, as kind of a subclass, carries out the step of increment cluster to newly-increased webpage, including:
Calculate that the web page characteristics of newly-increased webpage is vectorial and the centroid vector of each potential much-talked-about topic class similar Degree;
Similar to the centroid vector of certain potential much-talked-about topic class at the web page characteristics vector of certain newly-increased webpage When degree is more than or equal to the first similarity threshold, this newly-increased webpage is added to this potential much-talked-about topic class.
3. method as claimed in claim 2, it is characterised in that the matter of described potential much-talked-about topic class Heart vector is that the web page characteristics vector to the focus webpage included by potential much-talked-about topic class is weighted place Reason obtains, and wherein, the weight of the web page characteristics vector of certain focus webpage is the use according to this focus webpage Belonging to family visit capacity and this focus webpage, total user of potential much-talked-about topic apoplexy due to endogenous wind all focuses webpage accesses The ratio of amount determines.
4. the method for claim 1, it is characterised in that also include:
Potential much-talked-about topic class after clustering for increment, is joined by the user's attention rate analyzing its correspondence Number changes within the conventional period, it was predicted that whether it is the much-talked-about topic class of subsequent period.
5. the method as according to any one of Claims 1-4, it is characterised in that also include:
Pointing out the much-talked-about topic class that judges or predict, corresponding suggestion content includes: corresponding The description key word of much-talked-about topic class.
6. method as claimed in claim 5, it is characterised in that described description key word includes: phase Answer several Feature Words that the middle co-occurrence degree of all webpages of much-talked-about topic class is the highest.
7. the method as according to any one of Claims 1-4, it is characterised in that described user is closed Note degree parameter includes web document quantity and user network behavior quantity.
8. a topic detection device based on big data, it is characterised in that including:
Abstraction module, for extracting focus webpage, wherein, described user according to user network behavioral data Network behavior data specifically can include in user access activity data and user's search behavior data Item or multinomial;
Acquisition module, for gathering the content of described focus webpage;
Extraction module, for the content according to described focus webpage, extracts the net obtaining described focus webpage Page characteristic vector, wherein, the described content according to described focus webpage, extract and obtain described focus webpage Web page characteristics vector include: according to collecting the content of webpage, extract in the webpage of described focus webpage Hold feature;Web page characteristics vector according to focus webpage described in described web page contents feature construction;
Cluster module, for carrying out described focus webpage according to the web page characteristics of described focus webpage vector Cluster, obtains corresponding potential much-talked-about topic class;
Increment cluster module, for using described potential much-talked-about topic class as kind of a subclass, enters newly-increased webpage Row increment clusters;Described newly-increased webpage is included in gauze page;And
Determination module, the potential much-talked-about topic class after clustering for increment, by analyzing its correspondence User's attention rate parameter, it is determined that whether it is much-talked-about topic class;
Described determination module includes: weighting decision sub-module, certain the potential focus after clustering when increment All potential heat after the weighting result of user's attention rate parameter that topic class is corresponding and increment cluster When the ratio of the weighting result of user's attention rate parameter that some topic class is corresponding is more than first threshold, Judge that this potential much-talked-about topic class is as much-talked-about topic class;
Described increment cluster module includes: Similarity Measure submodule, for calculating the webpage of newly-increased webpage The similarity of the centroid vector of characteristic vector and each potential much-talked-about topic class;
Comparison sub-module, vectorial and certain potential much-talked-about topic class for the web page characteristics at certain newly-increased webpage When the similarity of centroid vector is more than or equal to the first similarity threshold, this newly-increased webpage is added and dives to this In much-talked-about topic class.
CN201310075129.4A 2013-03-08 2013-03-08 A kind of topic detection method and device based on big data Active CN103177090B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310075129.4A CN103177090B (en) 2013-03-08 2013-03-08 A kind of topic detection method and device based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310075129.4A CN103177090B (en) 2013-03-08 2013-03-08 A kind of topic detection method and device based on big data

Publications (2)

Publication Number Publication Date
CN103177090A CN103177090A (en) 2013-06-26
CN103177090B true CN103177090B (en) 2016-11-23

Family

ID=48636951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310075129.4A Active CN103177090B (en) 2013-03-08 2013-03-08 A kind of topic detection method and device based on big data

Country Status (1)

Country Link
CN (1) CN103177090B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461842B (en) * 2013-09-23 2018-02-16 伊姆西公司 Based on daily record similitude come the method and apparatus of handling failure
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN104486461B (en) * 2014-12-29 2019-04-19 北京奇安信科技有限公司 Domain name classification method and device, domain name recognition methods and system
CN104933622A (en) * 2015-03-12 2015-09-23 中国科学院计算技术研究所 Microblog popularity degree prediction method based on user and microblog theme and microblog popularity degree prediction system based on user and microblog theme
CN104850606B (en) * 2015-05-03 2019-03-26 西北工业大学 Method for summarizing social events in mobile crowd sensing
CN106874292B (en) * 2015-12-11 2020-05-05 北京国双科技有限公司 Topic processing method and device
CN106874299A (en) * 2015-12-14 2017-06-20 北京国双科技有限公司 Page detection method and device
CN106021425A (en) * 2016-05-13 2016-10-12 北京奇虎科技有限公司 Hot news mining method and device
CN106130756B (en) * 2016-06-15 2019-06-14 晶赞广告(上海)有限公司 A kind of method and device of prediction access content clicking rate
CN106126632A (en) * 2016-06-22 2016-11-16 北京小米移动软件有限公司 Recommend method and device
CN106354846A (en) * 2016-08-31 2017-01-25 成都广电视讯文化传播有限公司 Intelligent news manuscript selection method and system based on big data
CN108228602A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 The sorting technique and device of website
CN108512873B (en) * 2017-02-27 2020-02-04 中国科学院沈阳自动化研究所 Packet semantic message filtering and routing method of distributed self-organizing structure
CN107103043A (en) * 2017-03-29 2017-08-29 国信优易数据有限公司 A kind of Text Clustering Method and system
CN107784127A (en) * 2017-11-30 2018-03-09 杭州数梦工场科技有限公司 A kind of focus localization method and device
CN107944931A (en) * 2017-12-18 2018-04-20 平安科技(深圳)有限公司 Seed user expanding method, electronic equipment and computer-readable recording medium
CN108255978A (en) * 2017-12-28 2018-07-06 曙光信息产业(北京)有限公司 The method and system of Press release topic cluster
CN109190003B (en) * 2018-08-20 2021-03-02 上海蜜度信息技术有限公司 Method and apparatus for determining list page nodes
CN109408639B (en) * 2018-10-31 2022-05-31 广州虎牙科技有限公司 Bullet screen classification method, bullet screen classification device, bullet screen classification equipment and storage medium
CN111026990B (en) * 2019-12-05 2024-04-16 中国银行股份有限公司 Hot topic log information display method and device
CN111339784B (en) * 2020-03-06 2023-03-14 支付宝(杭州)信息技术有限公司 Automatic new topic mining method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231640A (en) * 2007-01-22 2008-07-30 北大方正集团有限公司 Method and system for automatically computing subject evolution trend in the internet
CN101408898A (en) * 2008-11-07 2009-04-15 北大方正集团有限公司 Method and device for extracting web page text
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method
CN102194001A (en) * 2011-05-17 2011-09-21 杭州电子科技大学 Internet public opinion crisis early-warning method
CN102708153A (en) * 2012-04-18 2012-10-03 中国信息安全测评中心 Self-adaption finding and predicting method and system for hot topics of online social network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676465B2 (en) * 2006-07-05 2010-03-09 Yahoo! Inc. Techniques for clustering structurally similar web pages based on page features

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231640A (en) * 2007-01-22 2008-07-30 北大方正集团有限公司 Method and system for automatically computing subject evolution trend in the internet
CN101408898A (en) * 2008-11-07 2009-04-15 北大方正集团有限公司 Method and device for extracting web page text
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method
CN102194001A (en) * 2011-05-17 2011-09-21 杭州电子科技大学 Internet public opinion crisis early-warning method
CN102708153A (en) * 2012-04-18 2012-10-03 中国信息安全测评中心 Self-adaption finding and predicting method and system for hot topics of online social network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张柱山.基于聚类分析的网络论坛热点话题检测.《中国优秀硕士学位论文全文数据库 信息科技辑》.2011, *

Also Published As

Publication number Publication date
CN103177090A (en) 2013-06-26

Similar Documents

Publication Publication Date Title
CN103177090B (en) A kind of topic detection method and device based on big data
Orlandi et al. Aggregated, interoperable and multi-domain user profiles for the social web
Michlmayr et al. Learning user profiles from tagging data and leveraging them for personal (ized) information access
Bedi et al. Focused crawling of tagged web resources using ontology
CN110597981A (en) Network news summary system for automatically generating summary by adopting multiple strategies
Baloglu et al. BlogMiner: Web blog mining application for classification of movie reviews
CN102004774A (en) Personalized user tag modeling and recommendation method based on unified probability model
KR20120108095A (en) System for analyzing social data collected by communication network
Shani et al. Mining recommendations from the web
Li et al. CoWS: An Internet-enriched and quality-aware Web services search engine
Leopairote et al. Software quality in use characteristic mining from customer reviews
Liu et al. Applying ontology learning and multi-objective ant colony optimization method for focused crawling to meteorological disasters domain knowledge
Choudhary et al. Role of ranking algorithms for information retrieval
Fernandes et al. Automated disaster news collection classification and geoparsing
Basile et al. Populating a knowledge base with object-location relations using distributional semantics
Li et al. Research on hot news discovery model based on user interest and topic discovery
AlSulaim et al. Prediction of Anime Series' Success using Sentiment Analysis and Deep Learning
Sumathi et al. Hybrid recommendation system using particle swarm optimization and user access based ranking
Chen et al. Design of automatic extraction algorithm of knowledge points for MOOCs
CN102495844B (en) Improved GuTao method for creating user models
Khan et al. Personal Adaptive Web agent: a tool for information filtering
Pushpa Rani et al. An optimized topic modeling question answering system for web-based questions
Zaveri et al. Mining User's Browsing History to Personalize Web Search
Liu et al. Understanding Consumer Preferences---Eliciting Topics from Online Q&A Community
CN102890715A (en) Device and method for automatically organizing specific domain information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20170817

Address after: 834000, cloud computing industry park, Xinjiang, Karamay A-00027

Patentee after: Karamay Silk Road Digital Technology Co., Ltd.

Address before: 100081, Haidian District, Beijing South Street, northeast flourishing, Beijing Zhongguancun software incubator, building 1, block C, three, 1322-D

Patentee before: IZP (Beijing) Technologies Co., Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201230

Address after: No. a9-9009, floor 1, No. 28, information road, Haidian District, Beijing

Patentee after: IZP (BEIJING) TECHNOLOGIES Co.,Ltd.

Address before: No. a-00027, cloud computing Industrial Park, Karamay, Xinjiang 834000

Patentee before: Karamay Silk Road Digital Technology Co.,Ltd.