CN103853763A

CN103853763A - Information acquiring method and device

Info

Publication number: CN103853763A
Application number: CN201210509047.1A
Authority: CN
Inventors: 程刚; 潘璇; 庄子明; 李鹤; 王谷丹; 周霄骁; 刘新鸣; 芦方
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2012-12-03
Filing date: 2012-12-03
Publication date: 2014-06-11
Anticipated expiration: 2032-12-03
Also published as: CN103853763B

Abstract

The invention discloses an information acquiring method and device, and belongs to the technical field of information. The method includes acquiring relevant information of a determined user in the current period, and preprocessing relevant information, including relevant information published and transmitted by the determined user, to acquire relevant words of the relevant information; determining noticed values of the relevant words of the relevant information; acquiring key words in the relevant information according to the noticed values of the relevant words; and acquiring information noticed by the determined user according to the acquired key words in the relevant information of the determined user.

Description

The method and apparatus of obtaining information

Technical field

The present invention relates to microblog technology field, particularly a kind of method and apparatus of obtaining information.

Background technology

Microblogging is an Information Sharing based on customer relationship, propagates and obtain platform, and user can be set up individual community by WEB, WAP and various client, with the word lastest imformation of 140 words left and right, and realizes and immediately sharing.On microblogging, user both can be used as spectators, browsed interested information, also can be used as publisher, on microblogging content distributed confession others browse.The feature of microblogging maximum releases news fast exactly, and the speed of Information Communication is fast.Based on the feature of microblogging, increasing user participates in microblogging platform, and comprising star or other celebrity, microblogging platform is that this special population is provided with famous person's microblogging.Because famous person's audience can be a lot, the topic of the concern in famous person's microblogging probably becomes information, so the information of how excavating fast in famous person's microblogging is the problem that needs solution.

Summary of the invention

In order to excavate fast the information in famous person's microblogging, the embodiment of the present invention provides a kind of method and apparatus of obtaining information.Described technical scheme is as follows:

On the one hand, provide a kind of method of obtaining information, described method comprises:

Obtain the relevant information of designated user in current slot, and described relevant information is carried out to pre-service, obtain the related term of described relevant information, wherein, described relevant information comprises: relevant information that described designated user is delivered or that forward;

Determine the concern value of the related term of described relevant information;

Obtain the keyword in described relevant information according to the concern value of described related term;

According to the described keyword obtaining, obtain the information that described designated user is paid close attention to.

On the other hand, provide a kind of device of obtaining information, described device comprises:

Pretreatment module, for obtaining the relevant information of designated user in current slot, and described relevant information is carried out to pre-service, obtain the related term of described relevant information, wherein, described relevant information comprises: relevant information that described designated user is delivered or that forward;

Determination module, for determining the concern value of related term of described relevant information;

The first acquisition module, for obtaining the keyword of described relevant information according to the concern value of described related term;

The second acquisition module, for the described keyword obtaining described in basis, obtains the information that described designated user is paid close attention to.

The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is: obtain the relevant information of designated user in current slot, and described relevant information is carried out to pre-service, obtain the related term of described relevant information, wherein said relevant information comprises relevant information that described designated user is delivered or that forward; Determine the concern value of the related term of described relevant information; Obtain the keyword in described relevant information according to the concern value of described related term; According to the described keyword obtaining, obtain the information that described designated user is paid close attention to.Thereby can excavate fast the information in famous person's microblogging.

Brief description of the drawings

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing of required use during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the method flow diagram of a kind of obtaining information of providing in the embodiment of the present invention one;

Fig. 2 is the method flow diagram of a kind of obtaining information of providing in the embodiment of the present invention two;

Fig. 3 is the apparatus structure schematic diagram of a kind of obtaining information of providing in the embodiment of the present invention three;

Fig. 4 is the apparatus structure schematic diagram of the another kind of obtaining information that provides in the embodiment of the present invention three.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

Embodiment mono-

Referring to Fig. 1, a kind of method of obtaining information is provided in the present embodiment, comprising:

101, obtain the relevant information of designated user in current slot, and described relevant information is carried out to pre-service, obtain the related term of described relevant information, wherein, described relevant information comprises relevant information that described designated user is delivered or that forward;

102, determine the concern value of the related term of described relevant information;

103, obtain the keyword in described relevant information according to the concern value of described related term;

104, the keyword obtaining described in basis, obtains the information that described designated user is paid close attention to.

The beneficial effect of the present embodiment is: obtain the relevant information of designated user in current slot, and described relevant information is carried out to pre-service, obtain the related term of described relevant information, wherein said relevant information comprises relevant information that described designated user is delivered or that forward; Determine the concern value of the related term of described relevant information; Obtain the keyword in described relevant information according to the concern value of described related term; According to the described keyword obtaining, obtain the information that described designated user is paid close attention to.Thereby can excavate fast the information in famous person's microblogging.

Embodiment bis-

The embodiment of the present invention provides a kind of method of obtaining information, and referring to Fig. 2, method flow comprises:

201, obtain the relevant information of designated user in current slot.

In the present embodiment, designated user comprises star or other have carried out the celebrity that famous person authenticates on microblogging.Wherein, current slot refers to that current time is to current time a period of time before, for example, current time is 11 points, current slot can be 10 o'clock to 11 o'clock, or 10:30 is to 11 points, wherein current slot can calculate by the hour also and can calculate by other unit, and this present embodiment is not specifically limited.Described relevant information comprises relevant information that described designated user is delivered or that forward, and for example, the blog article that user delivers or forwards etc., are not specifically limited this present embodiment.

202, described relevant information is carried out to pre-service, obtain the related term of described relevant information.

In the present embodiment, described relevant information is carried out to pre-service, obtain the related term of described relevant information, comprising: described relevant information is carried out to pre-service, to remove the character of the punctuation mark in described relevant information, invisible character and demonstration mess code; Described pretreated relevant information is carried out to word segmentation processing; Relevant information after described participle is mated with default vocabulary, filter out the appointment word in the relevant information after described participle, obtain the related term of described relevant information.

In the present embodiment, relevant information is carried out to pre-service, the spoken and written languages of retention figures, Chinese character, English or other countries, remove punctuation mark and Mars word, radical etc. invisible character or mess code.Appointment word includes but not limited to: adverbial word, auxiliary word that indicates mood, dirty word, pornographic word, political sensitive word and other meaningless everyday words.Participle refers to Chinese word segmentation (Chinese Word Segmentation), is that a Chinese character sequence is cut into independent one by one word.Wherein, set in advance vocabulary storehouse, this vocabulary storehouse comprises above-mentioned appointment word, and relevant information is being carried out after participle, by mating with default vocabulary, further deletes and specifies word, obtains the related term of relevant information.The vocabulary of related term comprises: adjective, distinction word, noun, name, place name, group of mechanism, other proper names, place word, time word, verb, gerund etc.

203, determine the concern value of the related term of described relevant information, and obtain the keyword in described relevant information according to the concern value of described related term.

Keyword in the present embodiment refers to the popular vocabulary in network.Keyword is as a kind of vocabulary phenomenon, reflected that a country, area, in certain people's question of common concern and things, have temporal characteristics, much-talked-about topic and the people's livelihood problem etc. of a specific time period of reflection in period.

In the present embodiment, for extracting keyword, need to first calculate the concern value of the related term after participle, then go out keyword according to default Rule Extraction.The concern value of the wherein said related term of determining described relevant information, comprising: determine respectively the frequency of occurrences of described related term in described current slot and the historical frequency of occurrences of described related term; The historical frequency of occurrences of the frequency of occurrences according to described related term in described current slot and described related term, obtains the relative change rate of described related term; Obtain the concern value of described related term according to the relative change rate of described related term.

Wherein, in the frequency of occurrences (the qv)=current slot of related term in current slot there is the relevant information sum of microblogging in total degree/current slot in related term.

In the present embodiment, in order to reduce day part and the difference between different week, the historical frequency of occurrences of related term by related term at the same time first frequency of occurrences (H_score), second frequency of occurrences (W_socre) within same week in section and the 3rd frequency of occurrences (A_score) that whole day occurs in Preset Time section obtain, i.e. hist=α * (H_score)+β * (W_socre)+γ * (A_score).In formula, α, β, γ are coefficient, and alpha+beta+γ=1.

Wherein, the same time period refers in the Preset Time section before described current slot and hour interval identical with described current slot, for example, current time section 10 o'clock to 11 o'clock, the same time period refers to that 1 to the data of 10 o'clock to 11 o'clock before n days, and wherein n is Preset Time section, can be 1 day, 2 days, 3 days etc. h_qv (i) is identical with the current slot timing statistics hour interval vocabulary frequency of occurrences before i days.

Refer in same week in described Preset Time section, differ the date of seven days with described current slot and with described current slot identical hour interval, (n=1,2,3 before 7*n days ...).

qv (7*i) be before i days the same day this vocabulary the frequency of occurrences.Totally refer to in Preset Time section, the quantity of whole day i.e. 1 numerical value (n=1,2,3 of arriving before n days ...),

qv (i) be before i days the same day this vocabulary the frequency of occurrences.

The historical frequency of occurrences of related term k is:

Hist (k) = α * Σ_{i = 1}^{n} h_qv (i) / n + β

* Σ_{i = 1}^{n / 7} qv (7 * i) / n + γ * Σ_{i = 1}^{n / 7} qv (i) / n .

Frequency of occurrences qv(k according to above-mentioned related term k in current slot) and historical frequency of occurrences Hist(k), relative change rate Hot_score (K)=qv (k)/Hist(k of calculating related term k).

In the present embodiment, obtaining a kind of relative change rate according to described related term after the relative change rate of related term and obtaining the concern value of described related term, comprising:

Obtaining respectively according to the relative change rate of described related term historical frequency and described related term that described related term is not keyword is in history the historical frequency of keyword in history;

The historical frequency and the described related term that are not keyword in history according to the relative change rate of described related term, described related term are the historical frequency of keyword in history, obtain the concern value of described related term.

For making those skilled in the art more clearly understand the concern value that the first provided by the invention obtains described related term according to the relative change rate of described related term, be now explained as follows:

If it is not keyword that word K had d-1 days in historical n-1 days, get the data in (d-1) sky in history, according to aforementioned calculating Hist(k) scheme, the historical frequency Hist(k of keyword) d-1, the historical frequency Hist(k of keyword) n-d, according to binomial(Bernoulli Jacob) distribute known, at n days be not keyword occur probability be:

P (ξ=n) normal=C (n, d) * Hist(k) d^d* (1-p) ^ (n-d), wherein C (n, d)=n! / (d! * (n-d)! );

That the probability that keyword occurs is at n days:

P (ξ=n) hot=C (n, n-d) * Hist(k) n-d^ (n-d) * (1-p) ^d, wherein C (n, n-d)=n! / ((n-d)! * d! );

The concern value of the related term obtaining is:

Final?socre（K）=Hot_score(K)*P(ξ=n)normal/P(ξ=n)hot。

Wherein Final_socre is larger, represents that the concern value of this word is higher.In the present embodiment, further, using related term higher concern value as keyword.

In the present embodiment alternatively, obtaining the another kind of concern value that obtains described related term according to the relative change rate of described related term after the relative change rate of related term, comprise: according to neuronic nonlinear interaction function sigmoid, the relative change rate of described related term is carried out to binomial differentiation, obtain the concern value of described related term.

For making those skilled in the art more clearly understand the method that the second provided by the invention obtains the concern value of described related term according to the relative change rate of described related term, be now explained as follows:

The relative change rate of related term k is carried out to binomial differentiation with sigmoid:

F(x)=1/ (1-e ^{-τ x}), wherein x=Hot_score (K), τ is parameter, τ value is greater than 0 and is less than 1, for example, and τ=0.01,0.05 ... concrete τ can regulate to be conducive to make the numerical value after sigmoid to trend towards (0,1) distribution, Finale_score (K)=1/ (1-e according to data source ^{-τ Hot_score (K)}).A mistake! Do not find Reference source.

After obtaining the concern value of each related term, the related term that Finale_score (K) is greater than to η is as keyword.Wherein η ∈ (0.5,1).The value of η, can obtain according to experimental data or experience, and this present embodiment is not specifically limited.

Further, in the present embodiment, obtain the keyword in the relevant information of described designated user according to the concern value of described related term, comprise: the concern value of described related term and the first predetermined threshold value are compared, and the related term that the concern value of described related term is greater than to described the first predetermined threshold value is as keyword.Determine the method for the concern value of related term for the first, can set the first predetermined threshold value, using concern value higher than the related term of the first predetermined threshold value as keyword, and determine the method for the concern value of related term for the second, η is the first predetermined threshold value, and the related term that Finale_score (K) is greater than to η is as keyword.

What deserves to be explained is, determine that from the second the method for concern value can find out, if Selecting time scope is too short, for example word K was keyword in historical n days, but time on a declining curve, the method can go out keyword K by None-identified, and therefore second method needs the time range of value to need long enough.

Keyword in the relevant information of the described designated user 204, obtaining described in basis, obtains the information that described designated user is paid close attention to.

In the present embodiment, after the keyword in the relevant information that obtains designated user, described keyword is carried out to readability expansion, obtain the information that user pays close attention to, the information that wherein user pays close attention to includes but not limited to: the hot issue on network or on microblogging.Concrete, a kind of described according to the keyword in the described relevant information of determining the described designated user obtaining, obtain the information of described designated user concern, comprising:

Described keyword is mated with default topic, find the keyword that can mate with described default topic;

Determine the keyword that can match with described default topic in described keyword and the score value of described default topic;

Obtain according to the score value of the keyword that can match with described default topic in described keyword and described default topic the information that described designated user is paid close attention to.

For making those skilled in the art more clearly understand the first provided by the invention according to the keyword in the described relevant information of determining the described designated user obtaining, obtain the method for the information of described designated user concern, be now explained as follows:

Wherein default topic includes but not limited to: the query word of search, headline after treatment (comprise length is processed, shorten length), microblog topic etc.

With default topic and keyword mate and find can with the keyword Match of default topic coupling, alternatively, can also further find other word surplus removing in the keyword lost that can not mate with default topic and default topic after described keyword, determine respectively the score value of these keywords and default topic.For example, the existing keyword set { " Harbin " about certain focus, " Yang Mingtan ", " bridge " }, there is now the description of a topic to be " positive bright beach bridge caves in ", here comprise " Yang Mingtan ", " bridge ", these two words are exactly the above-mentioned word that can mate with topic, and this word is called as Match, here do not comprise " Harbin " this keyword, this word is known as lost; Also be keyword if " cave in ", the keyword set that " positive bright beach bridge caves in " comprises is combined into { " Yang Mingtan ", " bridge ", " cave in " }, so here, " caving in " is exactly than the above-mentioned existing keyword set { " Harbin " about certain focus, " Yang Mingtan ", " bridge " } unnecessary word, be exactly the surplus described in literary composition.Wherein,

Match_score ({topic}_{k}) = Σ_{i = 1}^{n} match (Hot_score (i)),

Topic _krepresent K default topic, match (Hot_score (i)) represents to filter out the concern value of the keyword that can match with default topic in Hot_score (i), Match_score (topic _k) represent the keyword that can match and the score value of topic;

lost_score ({topic}_{k}) = - Σ_{i = 1}^{n} lost (Hot_score (i)),

Represent to filter out the keyword of presetting after topic and topic cluster in Hot_score (i) and compare, the concern value of the keyword that cannot match with default topic in the keyword after cluster, lost_score (topic _k) represent the keyword that cannot match with default topic and the score value of default topic;

surplus_score ({topic}_{k}) = - Σ_{i = 1}^{n} surplus (Hot_score (i)),

Represent to filter out the keyword of presetting after topic and topic cluster in Hot_score (i) and compare, the mark of the unnecessary keyword comprising in default topic, surplus_score (topic _k) represent according to presetting the score value that in topic, unnecessary keyword calculates;

Information topic_score (k)=α * Match_score (topic that user pays close attention to _k)+β * lost_score (topic _k)+γ * surplus_score (topic _k), wherein, in formula, α, β, γ are coefficient, and alpha+beta+γ=1.

What deserves to be explained is, can carry out similarity to microblogging by keyword and calculate and cluster, also cluster not, the lost_score (topic in above publicity _k) and surplus_score (topic _k) only have first topic is carried out just needing to calculate after cluster, in the time of cluster not, also can not calculate the score value of lost_score and surplus_score, now α=1, β=0, γ=0, topic_score (k)=Match_score (topic _k).

Alternatively, another kind of according to the keyword in the described relevant information of determining the described designated user obtaining in the present embodiment, obtain the information of described designated user concern, comprising:

The relevant information of the microblogging according to described keyword to described designated user is carried out cluster, by the same class that divides into high keyword similarity in described microblogging;

Determine the common subset of the relevant information of the microblogging after described cluster, wherein, the length that described common subset comprises described keyword and described common subset is less than or equal to the second predetermined threshold value;

Determine the mark of described common subset according to the concern value of described keyword;

Obtain according to the mark of described common subset the information that described designated user is paid close attention to.

Above-mentionedly according to keyword, the relevant information of microblogging is carried out to cluster, comprise microblogging keyword similarity high be classified as same class, then, the common subset of calculating the microblogging after cluster, common subset need comprise keyword and length can not exceed certain restriction, can not cross the second predetermined threshold value, the mark of common subset is the cumulative of keyword mark, preferably, the common subset of calculating is sorted according to mark, get mark and come the information that common subset is above paid close attention to as user.Wherein, the relevant information of carrying out the microblogging of cluster is the relevant information that designated user is delivered or forwarded at the appointed time, wherein the fixed time can be current slot, or can be also several hours recently or nearest one day etc., this present embodiment is not specifically limited.

For making those skilled in the art more clearly understand the second provided by the invention according to the keyword in the described relevant information of determining the described designated user obtaining, obtain the method for the information of described designated user concern, be now explained as follows:

In order to reduce calculated amount, utilize keyword first the relevant information of microblogging to be carried out to cluster, wherein can adopt SVM, K-cohesion cluster scheduling algorithm to carry out cluster, for which kind of clustering method the present embodiment of concrete employing be not specifically limited.

Relevant information object { the x of given microblogging ₁, x ₂... x _nand fixed integer k, the relevant information clustering problem of microblogging can be converted into the cost function that minimization is following:

J (X, c_{1}, c_{2}, . . . . . . c_{k}) = Σ_{i = 0}^{n} \min_{t = 1, . . . k} d (x_{i}, c_{t})

A mistake! Do not find Reference source.；

Wherein d (x _i, c _t) be the relevant information x of microblogging _ito cluster centre c _tdistance, be similar to the coherency function (aggregatefunction) of max-value function here by introduction, provided following k-aggregate clustering algorithm:

Given parameters τ ∈ (0,1), γ >0 and K initial center cluster point c ₁ ⁰, c ₂ ⁰..., c _k ⁰;

The first step: to L=1,2 ...., K; I=1,2 ...., n calculates

P (x_{i}, {c_{L}}^{h}) = \exp (- {| | x_{i} - {c_{l}}^{h} | |}^{2} / τ) / Σ_{j = 1}^{k} 1 * \exp (- {| | x_{i} - {c_{l}}^{h} | |}^{2} / τ);

{c_{L}}^{h + 1} = (Σ_{i = 1}^{n} 1 * P (x_{i}, {c_{l}}^{h}) * x_{i}) / Σ_{i = 1}^{n} 1 * P (x_{i}, {c_{l}}^{h});

Wherein, h initial value is 0;

Second step: if || c ₁ ^h+1-c ₁ ^h|| ₂ ²+ | c ₂ ^h+1-c ₂ ^h|| ₂ ²+ ... .+||c _k ^h+1-c _k ^h|| ₂ ²≤ γ stops the cluster to relevant information, otherwise makes the h=h+1 in the first step; τ=τ/2, repeat the first step and continue relevant information to carry out cluster.Wherein, τ, the initial value of γ can be rule of thumb or experimental data set, this present embodiment is not specifically limited.

The c calculating according to above-mentioned formula _l ^hbe the center relevant information of cluster.Wherein || x _i-c _l ^h|| by relevant information x _iwith relevant information c _l ^hbetween the included angle cosine of keyword vector obtain, similar with computing method of the prior art herein, be not described in detail in the present invention.

For example, the keyword calculating is " Harbin ", " Yang Tan ", " bridge " etc., and corresponding keyword mark is as shown in table 1:

Table 1

Keyword	Harbin	Yang Tan	Bridge
				Mark	3	3	1

Go out following microblogging by keyword clustering such as above-mentioned calculating " Harbin ", " Yang Tan ", " bridges ":

Microblogging A: " it is reported, Harbin Yang Tan bridge caves in, three dead five wounds ";

Microblogging B: " Harbin Yang Tan bridge has collapsed, what jerry-built project ";

Microblogging C: " hearing that Harbin Yang Tan bridge has collapsed, is genuine? "

Relevant information is carried out to common subset and can be calculated { " Harbin ", " Yang Tan ", " bridge ", " Harbin Yang Tan ", " positive beach bridge ", " Harbin Yang Tan bridge " }, mark corresponding to these common subset is the cumulative, as shown in table 2 below of the keyword mark that matches:

Table 2

" Harbin Yang Tan bridge " in upper table can match three keywords, the mark obtaining is 8 points, mark is the highest, to carry out the result after readable expansion be " Harbin Yang Tan bridge " to this group keyword { " Yang Tan ", " Harbin ", " bridge " }, and " Harbin Yang Tan bridge " is the information of user's concern.

205, determine the semantic similarity of the information that described user pays close attention to, to delete the topic and the topic of described skew and the corresponding relation of relevant information that produce skew in the information that described user pays close attention to.

In the present embodiment, after above keyword extraction and expansion, the information that the user who determines pays close attention to may produce semantic shift and cause different with former microblogging implication, therefore, need to determine semantic similarity in this step, the topic and the topic of described skew and the corresponding relation of relevant information that in mining process, produce skew are removed.

Wherein alternatively, determine the semantic similarity of the information of described designated user concern, to delete the topic and the corresponding relation that produce skew in the information that described designated user pays close attention to, comprise: the information that described designated user is paid close attention to is cut out in order, obtain multiple subsets of the information of described user's concern; The relevant information at the information place that multiple subsets of the information that described designated user is paid close attention to are paid close attention to described designated user is respectively mated, described in preserving, can be included in the subset of the relevant information of determining information place of described designated user concern, described in deleting, can not be included in the subset of the relevant information at the information place of described designated user concern.For example, the length of a topic ABCDEFGHIJ is 10 Chinese characters, after cutting out, obtain the subset that a length is 10 { ABCDEFGHIJ }, 2 subset { { ABCDEFGHI }, { BCDEFGHIJ }, 3 subsets { { ABCDEFGH } that length is 8 that length is 9, { BCDEFGHI }, { CDEFGHIJ}}...... simplifies to calculate to get some larger subsets, if its subset can be included in the original of microblogging, thinks that microblogging content and this topic are corresponding.

Alternatively, the semantic similarity of the another kind of information of determining described designated user concern, to delete the topic and the corresponding relation that produce skew in the information that described designated user pays close attention to, comprise: the information that described designated user is paid close attention to splits into orderly word sequence, the relevant information at the information place that the description vectors of the information of paying close attention to described word sequence as described designated user and described designated user are paid close attention to is carried out similarity calculating, to delete the topic and the corresponding relation that produce skew in the information that described designated user pays close attention to.

What deserves to be explained is, step 205 is the quality of the information paid close attention to of designated user in order to extract and the step that is optimized, can not all right step in practical implementation, this present embodiment is not specifically limited.

206, the information that the designated user that output is obtained is paid close attention to.

In the present embodiment, alternatively, after obtaining the information of designated user concern, further, from high to low, the information that described designated user is paid close attention to sorts and exports the information that the designated user after described sequence is paid close attention to the concern value of the keyword comprising in the information that can also pay close attention to according to described designated user.Wherein, the concern value of the keyword comprising in the concrete information that can pay close attention to according to designated user, determine the score value of the information of designated user concern, in the information of designated user being paid close attention to, the concern value of the keyword that comprises add up and obtains the score value of the information that designated user pays close attention to, and the score value of the information of then paying close attention to according to designated user sorts to the information of designated user concern.

When information that the designated user exported in the present embodiment is paid close attention to, can also export focus that famous person pays close attention to and the corresponding relation of famous person's microblogging simultaneously.

The beneficial effect of the present embodiment is: obtain the relevant information of designated user in current slot, and described relevant information is carried out to pre-service, obtain the related term of described relevant information, wherein said relevant information comprises relevant information that described designated user is delivered or that forward; Determine the concern value of the related term of described relevant information; Obtain the keyword in described relevant information according to the concern value of described related term; According to the described keyword obtaining, obtain the information that described designated user is paid close attention to.Thereby can excavate fast the information that the designated user in famous person's microblogging is paid close attention to.And the temperature of the information that further can pay close attention to according to designated user is to the information output of sorting.

Embodiment tri-

Referring to Fig. 3, the embodiment of the present invention provides a kind of obtaining information device, and this device comprises: pretreatment module 301, determination module 302, the first acquisition module 303 and the second acquisition module 304.

Pretreatment module 301, for obtaining the relevant information of designated user in current slot, and described relevant information is carried out to pre-service, obtain the related term of described relevant information, wherein, described relevant information comprises relevant information that described designated user is delivered or that forward;

Determination module 302, for determining the concern value of related term of described relevant information;

The first acquisition module 303, for obtaining the keyword of described relevant information according to the concern value of described related term;

The second acquisition module 304, for the keyword obtaining described in basis, obtains the information that described designated user is paid close attention to.

Wherein, referring to Fig. 4, described pretreatment module 301, comprising:

Delete cells 301a, for described relevant information is carried out to pre-service, to remove the character of the punctuation mark in described relevant information, invisible character and demonstration mess code;

Participle unit 301b, for carrying out word segmentation processing by described pretreated relevant information;

Overanxious unit 301c, for the relevant information after described participle is mated with default vocabulary, filters out the appointment word in the relevant information after described participle, obtains the related term of described relevant information.

Wherein, referring to Fig. 4, described determination module 302, comprising:

Determining unit 302a, for determining respectively the frequency of occurrences of described related term in described current slot and the historical frequency of occurrences of described related term;

The first acquiring unit 302b, for the historical frequency of occurrences of the frequency of occurrences in described current slot and described related term according to described related term, obtains the relative change rate of described related term;

Second acquisition unit 302c, for obtaining the concern value of described related term according to the relative change rate of described related term.

Wherein, described determining unit 302a, comprising:

Determine subelement, for determining respectively first frequency of occurrences of described related term, second frequency of occurrences and the 3rd frequency of occurrences, wherein, described first frequency of occurrences refers in the Preset Time section of described related term before described current slot and the frequency occurring in interval for identical with described current slot hour, described second frequency of occurrences refers to that described related term differs the date of seven days and the frequency of occurrences in identical hour interval with described current slot with described current slot in described Preset Time section, described the 3rd frequency of occurrences refers to the described related term frequency that whole day occurs in described Preset Time section, obtain the historical frequency of occurrences of described related term according to described first frequency of occurrences, described second frequency of occurrences and described the 3rd frequency of occurrences.

Alternatively, described the first acquiring unit 302b, comprising:

First obtains subelement, is the historical frequency of keyword in history for obtain respectively historical frequency and described related term that described related term is not keyword in history according to the relative change rate of described related term; The historical frequency and the described related term that are not keyword in history according to the relative change rate of described related term, described related term are the historical frequency of keyword in history, obtain the concern value of described related term.

Alternatively, described the first acquiring unit 302b, comprising:

Second obtains subelement, for according to neuronic nonlinear interaction function sigmoid, the relative change rate of described related term being carried out to binomial differentiation, obtains the concern value of described related term.

Wherein, the first acquisition module 303 specifically for:

The concern value of described related term and the first predetermined threshold value are compared, and the related term that the concern value of described related term is greater than to described the first predetermined threshold value is as keyword.

Alternatively, described the second acquisition module 304, comprising:

Matching unit 304a, for described keyword is mated with default topic, finds the keyword that can mate with described default topic;

The first determining unit 304b, for determining keyword that described keyword can match with described default topic and the score value of described default topic according to the concern value of the described keyword that can mate with described default topic;

The first acquiring unit 304c, obtains for the score value of the keyword that can match with described default topic according to described keyword and described default topic the information that described designated user is paid close attention to.

Alternatively, described the second acquisition module, comprising:

Cluster cell 304a', carries out cluster for the relevant information of the microblogging to described designated user according to described keyword, by the same class that divides into high keyword similarity in described microblogging;

The second determining unit 304b', for determining the common subset of relevant information of the microblogging after described cluster, wherein, the length that described common subset comprises described keyword and described common subset is less than or equal to the second predetermined threshold value;

The 3rd determining unit 304c', for determining the mark of described common subset according to the concern value of described keyword;

Second acquisition unit 305d', the information of paying close attention to for obtain described designated user according to the mark of described common subset.

Alternatively, referring to Fig. 4, described device also comprises:

Output module 305, for obtain the information of described designated user concern at described the second acquisition module 304 after, according to the concern value of the keyword comprising in described information from high to low, sorts and exports the information after described sequence described information.

Alternatively, referring to Fig. 4, described device also comprises:

Semantic Come-back module 306, before described information being sorted at described output module 305, determines the semantic similarity of described information, to delete the topic and the topic of described skew and the corresponding relation of relevant information that produce skew in described information.

Wherein alternatively, described Semantic Come-back module 306, comprising:

The first processing unit 306a, for described information is cut out in order, obtains multiple subsets of described information; Multiple subsets of described information are mated with the relevant information at described information place respectively, described in preserving, can be included in the subset of the relevant information at described information place, described in deleting, can not be included in the subset of the relevant information at described information place; Or,

The second processing unit 306b, for described information is split into orderly word sequence, the relevant information at the description vectors using described word sequence as described information and described information place is carried out similarity calculating, to delete the topic and the corresponding relation that produce skew in described information.

The beneficial effect of the present embodiment is: obtain the relevant information of designated user in current slot, and described relevant information is carried out to pre-service, obtain the related term of described relevant information, wherein said relevant information comprises relevant information that described designated user is delivered or that forward; Determine the concern value of the related term of described relevant information; Obtain the keyword in described relevant information according to the concern value of described related term; According to the described keyword obtaining, obtain the information that described designated user is paid close attention to.Thereby can excavate fast the information in famous person's microblogging.Thereby can excavate fast the information that the designated user in famous person's microblogging is paid close attention to.

It should be noted that: the device of the obtaining information providing in above-described embodiment, only be illustrated with the division of above-mentioned each functional module, in practical application, can above-mentioned functions be distributed and completed by different functional modules as required, be divided into different functional modules by the inner structure of device, to complete all or part of function described above.

In addition, the device of the obtaining information that above-described embodiment provides and the embodiment of the method for obtaining information belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

One of ordinary skill in the art will appreciate that all or part of step that realizes above-described embodiment can complete by hardware, also can carry out the hardware that instruction is relevant by program completes, described program can be stored in a kind of definite machine readable storage medium storing program for executing, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any amendment of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a method for obtaining information, is characterized in that, described method comprises:

2. method according to claim 1, is characterized in that, described described relevant information is carried out to pre-service, obtains the related term of described relevant information, comprising:

Described relevant information is carried out to pre-service, to remove the character of the punctuation mark in described relevant information, invisible character and demonstration mess code;

Described pretreated relevant information is carried out to word segmentation processing;

Relevant information after described participle is mated with default vocabulary, filter out the appointment word in the relevant information after described participle, obtain the related term of described relevant information.

3. method according to claim 1, is characterized in that, the concern value of the described related term of determining described relevant information, comprising:

Determine respectively the frequency of occurrences of described related term in described current slot and the historical frequency of occurrences of described related term;

The historical frequency of occurrences of the frequency of occurrences according to described related term in described current slot and described related term, obtains the relative change rate of described related term;

Obtain the concern value of described related term according to the relative change rate of described related term.

4. method according to claim 3, is characterized in that, the described historical frequency of occurrences of determining described related term, comprising:

Determine respectively first frequency of occurrences of described related term, second frequency of occurrences and the 3rd frequency of occurrences, wherein, described first frequency of occurrences refers in the Preset Time section of described related term before described current slot and the frequency occurring in interval for identical with described current slot hour, described second frequency of occurrences refers to that described related term differs the date of seven days and the frequency of occurrences in identical hour interval with described current slot with described current slot in described Preset Time section, described the 3rd frequency of occurrences refers to the described related term frequency that whole day occurs in described Preset Time section,

Obtain the historical frequency of occurrences of described related term according to described first frequency of occurrences, described second frequency of occurrences and described the 3rd frequency of occurrences.

5. method according to claim 3, is characterized in that, the described relative change rate according to described related term obtains the concern value of described related term, comprising:

6. method according to claim 3, is characterized in that, the described relative change rate according to described related term obtains the concern value of described related term, comprising:

According to neuronic nonlinear interaction function sigmoid, the relative change rate of described related term is carried out to binomial differentiation, obtain the concern value of described related term.

7. method according to claim 1, is characterized in that, obtains the keyword in described relevant information according to the concern value of described related term, comprising:

8. method according to claim 1, is characterized in that, the keyword obtaining described in described basis obtains the information that described designated user is paid close attention to, and comprising:

Determine the keyword that can match with described default topic in described keyword and the score value of described default topic according to the concern value of the described keyword that can mate with described default topic;

9. method according to claim 1, is characterized in that, the keyword obtaining described in described basis obtains the information that described designated user is paid close attention to, and comprising:

The relevant information of the microblogging according to described keyword to described designated user is carried out cluster, by the same class that divides into high keyword similarity in the relevant information of described microblogging;

10. method according to claim 1, is characterized in that, the keyword obtaining described in described basis, after obtaining the information that described designated user pays close attention to, also comprises:

From high to low, the information that described designated user is paid close attention to sorts and exports the information that the designated user after described sequence is paid close attention to the concern value of the keyword comprising in the information of paying close attention to according to described designated user.

11. methods according to claim 10, is characterized in that, the concern value of the keyword comprising in the described information of paying close attention to according to described designated user from high to low, before the information that described designated user is paid close attention to sorts, also comprises:

Determine the semantic similarity of the information that described designated user pays close attention to, to delete the topic and the topic of described skew and the corresponding relation of relevant information that produce skew in the information that described designated user pays close attention to.

12. methods according to claim 11, is characterized in that, the described semantic similarity of determining the information that described designated user pays close attention to, to delete the topic and the corresponding relation that produce skew in the information that described designated user pays close attention to, comprising:

The information that described designated user is paid close attention to is cut out in order, obtains multiple subsets of the information of described designated user concern;

The relevant information at the information place that multiple subsets of the information that described designated user is paid close attention to are paid close attention to described designated user is respectively mated, described in preserving, can be included in the subset of the relevant information at the information place of described designated user concern, described in deleting, can not be included in the subset of the relevant information at the information place of described designated user concern; Or,

The information that described designated user is paid close attention to splits into orderly word sequence, the relevant information at the information place that the description vectors of the information of paying close attention to described word sequence as described designated user and described designated user are paid close attention to is carried out similarity calculating, to delete the topic and the corresponding relation that produce skew in the information that described designated user pays close attention to.

The device of 13. 1 kinds of obtaining informations, is characterized in that, described device comprises:

The second acquisition module, for the keyword obtaining described in basis, obtains the information that described designated user is paid close attention to.

14. devices according to claim 13, is characterized in that, described pretreatment module, comprising:

Delete cells, for described relevant information is carried out to pre-service, to remove the character of the punctuation mark in described relevant information, invisible character and demonstration mess code;

Participle unit, for carrying out word segmentation processing by described pretreated relevant information;

Overanxious unit, for the relevant information after described participle is mated with default vocabulary, filters out the appointment word in the relevant information after described participle, obtains the related term of described relevant information.

15. devices according to claim 13, is characterized in that, described determination module, comprising:

Determining unit, for determining respectively the frequency of occurrences of described related term in described current slot and the historical frequency of occurrences of described related term;

The first acquiring unit, for the historical frequency of occurrences of the frequency of occurrences in described current slot and described related term according to described related term, obtains the relative change rate of described related term;

Second acquisition unit, for obtaining the concern value of described related term according to the relative change rate of described related term.

16. devices according to claim 15, is characterized in that, described determining unit, comprising:

17. devices according to claim 15, is characterized in that, described the first acquiring unit, comprising:

18. devices according to claim 15, is characterized in that, described the first acquiring unit, comprising:

19. devices according to claim 13, is characterized in that, the first acquisition module specifically for:

20. devices according to claim 13, is characterized in that, described the second acquisition module, comprising:

Matching unit, for described keyword is mated with default topic, finds the keyword that can mate with described default topic;

The first determining unit, for determining keyword that described keyword can match with described default topic and the score value of described default topic according to the concern value of the described keyword that can mate with described default topic;

The first acquiring unit, obtains for the score value of the keyword that can match with described default topic according to described keyword and described default topic the information that described designated user is paid close attention to.

21. devices according to claim 13, is characterized in that, described the second acquisition module, comprising:

Cluster cell, carries out cluster for the relevant information of the microblogging to described designated user according to described keyword, by the same class that divides into high keyword similarity in described microblogging;

The second determining unit, for determining the common subset of relevant information of the microblogging after described cluster, wherein, the length that described common subset comprises described keyword and described common subset is less than or equal to the second predetermined threshold value;

The 3rd determining unit, for determining the mark of described common subset according to the concern value of described keyword;

Second acquisition unit, the information of paying close attention to for obtain described designated user according to the mark of described common subset.

22. devices according to claim 13, is characterized in that, described device also comprises:

Output module, for obtain the information of described designated user concern at described the second acquisition module after, from high to low, the information that described designated user is paid close attention to sorts and exports the information that the designated user after described sequence is paid close attention to the concern value of the keyword comprising in the information of paying close attention to according to described designated user.

23. devices according to claim 22, is characterized in that, described device also comprises:

Semantic Come-back module, before sorting for the information of described designated user being paid close attention at described output module, determine the semantic similarity of the information that described designated user pays close attention to, to delete the topic and the topic of described skew and the corresponding relation of relevant information that produce skew in the information that described designated user pays close attention to.

24. devices according to claim 23, is characterized in that, described Semantic Come-back module, comprising:

The first processing unit, cuts out in order for the information that described designated user is paid close attention to, and obtains multiple subsets of the information of described designated user concern; The relevant information at the information place that multiple subsets of the information that described designated user is paid close attention to are paid close attention to described designated user is respectively mated, described in preserving, can be included in the subset of the relevant information at the information place of described designated user concern, described in deleting, can not be included in the subset of the relevant information at the information place of described designated user concern; Or,

The second processing unit, split into orderly word sequence for the information that described designated user is paid close attention to, the relevant information at the information place that the description vectors of the information of paying close attention to described word sequence as described designated user and described designated user are paid close attention to is carried out similarity calculating, to delete the topic and the corresponding relation that produce skew in the information that described designated user pays close attention to.