CN103514167B

CN103514167B - Data processing method and equipment

Info

Publication number: CN103514167B
Application number: CN201210202800.2A
Authority: CN
Inventors: 张波; 孟遥; 于浩
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2012-06-15
Filing date: 2012-06-15
Publication date: 2017-03-01
Anticipated expiration: 2032-06-15
Also published as: CN103514167A

Abstract

The invention discloses a kind of data processing method and equipment, the method can include：Active time is interval to determine step, for determining the microblog users group with similar active custom, and based on determined by the microblogging issued of concern user in microblog users group determine that the active time of each microblog users group is interval；Keyword extraction step, for from determined by all microbloggings in active time interval extract key words；And topic determines step, for based on the key word being extracted, the corresponding topic in active time interval determined by determination.According to the present invention it is possible to excavate specific microblog users group topic of interest in different active time intervals, thus targetedly enter row information and issuing and obtain, substantially increase the efficiency of information processing.

Description

Data processing method and equipment

Technical field

The present invention relates to a kind of data processing method and equipment, can excavate in special time more particularly, to one kind The different user group topic, data processing method based on microblogging of interest and equipment in interval.

Background technology

In recent years, with the development of Internet technology, microblogging（micro-blog）It has been increasingly becoming people's communication exchange One of important way.In numerous and jumbled network data, how to excavate required information more efficiently to carry out data processing to mutual Networking technology proposes new challenge.

For example, for general working clan, on weekdays, it may concentrate in the active time interval of microblogging and for example go up Between thirty at noon 8 to 9 thirty and 1 point to 2 points of afternoon（That is, a period of time before devoting oneself to work）And to 10 thirty in the evening 8 Thirty（That is, leisure time after meal）Etc., and at weekend, its active time interval may differ greatly from the work between date Jump time interval.Accordingly, it would be desirable to a kind of can determine different user group in different active time interval topic of interest with Targetedly enter row information to issue and obtain, thus greatly improving the technology of data-handling efficiency.

Content of the invention

Brief overview with regard to the present invention is given below, to provide basic with regard to certain aspects of the invention Understand.It is understood, however, that this general introduction is not the exhaustive general introduction with regard to the present invention.It is not intended to for determining The critical component of the present invention or pith, are not to be intended to limit the scope of the present invention.Its purpose is only with letter The form changed provides some concepts with regard to the present invention, in this, as preamble in greater detail given later.

Therefore, in view of said circumstances, it is an object of the invention to provide a kind of data processing method and equipment, it can pass through Different active time for specific microblog users group are interval, if determining that each customer group is of interest in this active time interval Topic, so that user can targetedly release news and efficiently obtain required information.

To achieve these goals, according to an embodiment of the invention on one side, there is provided a kind of data processing method, Including：Active time is interval to determine step, for determining the microblog users group with similar active custom, and based on being determined Microblog users group in the microblogging issued of concern user interval come the active time to determine each microblog users group；Key word carries Take step, for from determined by all microbloggings in active time interval extract key words；And topic determines step, it is used for Based on the key word being extracted, the corresponding topic in active time interval determined by determination.

According to a preferred embodiment of the invention, in active time interval determination step, determine that there is similar active custom Microblog users group may further include user vector build sub-step, for according to the conventional issuing microblog of microblog users when Between and quantity building the user vector with predetermined dimensions；While determining sub-step, based on the similarity between each user vector, Determine the side between user node；Microblog users group builds sub-step, for side determined by being based on, build have similar active The microblog users group of custom；And concern user determine sub-step, for the vermicelli quantity based on each microblog users, issue micro- Rich quantity, the reply quantity to the microblogging that this microblog users is issued and the forwarding number to the microblogging that this microblog users is issued One or more of amount, determines the technorati authority of this microblog users, thus selected predetermined from microblog users group based on technorati authority The microblog users of quantity are as concern user.

According to another preferred embodiment of the invention, determine in step active time is interval, be based on determined by microblogging The microblogging that concern user in customer group issues may further include come the active time interval to determine each microblog users group： Microblogging quantity statistics sub-step, for the number of statistics microblogging that described concern user issues within each period of predetermined period Amount, thus obtain the microblogging quantity series with time correlation；Sequence recursive subdivision sub-step, for the microblogging quantity being counted Sequence carries out recursive subdivision, thus obtaining one or more cut-points；And active time interval selection sub-step, in base When in the time interval that obtained cut-point determines, the larger top n time interval of selection standard variance is as described enlivening Between interval, wherein N is more than or equal to 1, wherein, in sequence recursive subdivision sub-step：For each point in current sequence, according to Below equation is calculated：

AnthorV(i)=|L1(i)|*Var(L1(i))/|L|+|L2(i)|*Var(L2(i))/|L|

DiffV(i)=Var(L(i))-AnthorV(i)

Wherein, | L1 (i) |, | L2 (i) | represent that supposition i is current cut-point to two obtaining after current sequence segmentation respectively The length of individual subsequence, | L | represents the length of current sequence, and Var () represents the standard variance of current sequence or subsequence；

Find out the maximum point of DiffV (i) in current sequence；And

If the DiffV (i) of this point is less than predetermined threshold, stops recursive subdivision, otherwise take this point as current sequence Cut-point current sequence is divided into two subsequences, and continue to carry out recursive subdivision respectively to this two subsequences.

According to the another preferred embodiment of the present invention, topic determines that step may further include：Candidate key word list Determine sub-step, interval for active time determined by being directed to, calculate the weight of each key word being extracted, and by weight It is included in the interval candidate key word list of active time more than the key word of predetermined threshold；Keyword relevance calculates sub-step Suddenly, for the degree of association between any two key word in candidate key word list determined by calculating；Figure construction sub-step, For with each key word in candidate key word list as node, to calculate degree of association more than predetermined threshold as key Structural map is carried out on side between word；And topic determines sub-step, for based on the figure being constructed, using clustering algorithm, determine institute The corresponding topic in active time interval determining.

According to the further embodiment of the present invention, determine in sub-step in candidate key word list, can be directed to described Active time is interval, calculates the weight of each key word according to below equation：

W(k)=count(k)*log(Q/counttimes(k))*log(authorfollowers(k))

Wherein, count (k) represents the occurrence number of key word k, and Q represents the microblogging quantity in active time interval, Counttimes (k) represents microblogging number key word k, and authorfollowers (k) represents that issue includes key word k's The vermicelli sum of the people of microblogging.

According to an embodiment of the invention on the other hand, additionally provide a kind of data handling equipment, it includes：Active time Interval determination unit, is configured to determine the microblog users group with similar active custom, and microblogging determined by being based on is used The microblogging that concern user in the group of family issues is interval come the active time to determine each microblog users group；Keyword extracting unit, Be configured to from determined by all microbloggings in active time interval extract key words；And topic determining unit, it is configured Become based on the key word being extracted, the corresponding topic in active time interval determined by determination.

In addition, according to an embodiment of the invention on the other hand, additionally providing a kind of terminal unit, this terminal unit includes Above-mentioned data handling equipment.This terminal unit for example includes mobile phone, palm PC, panel computer, PC, etc..

In addition, another according to an embodiment of the invention aspect, additionally provide a kind of storage medium, this storage medium includes Machine-readable program code, when configuration processor code on messaging device, this program code makes information processing set Standby execution is according to the data processing method of the present invention.

Additionally, another further aspect according to an embodiment of the invention, additionally provide a kind of program product, this program product includes The executable instruction of machine, when execute instruction on messaging device, this instruction makes messaging device execution basis The data processing method of the present invention.

Therefore, according to embodiments of the invention, can targetedly carry out topic issue and acquisition of information such that it is able to Better profit from microblog and obtain information, substantially increase the efficiency of data processing.

Following description partly in provide other aspects of the embodiment of the present invention, wherein, describe in detail for abundant The preferred embodiment of the open embodiment of the present invention in ground, and it is not applied to limit.

Brief description

The present invention can be by reference to being better understood below in association with the detailed description given by accompanying drawing, wherein Employ same or analogous reference in all of the figs to represent same or like part.Described accompanying drawing together with The detailed description in face comprises together in this manual and forms a part for description, for the present invention is further illustrated Preferred embodiment and explain the present invention principle and advantage.Wherein：

Fig. 1 is the flow chart illustrating data processing method according to an embodiment of the invention；

Fig. 2 is to be shown in the active time interval determination step shown in Fig. 1 to determine the microblogging with similar active custom The flow chart of the detailed process of customer group；

Fig. 3 is to be shown in the active time interval shown in Fig. 1 to determine that in step, the microblogging based on concern user's issue is Lai really The flow chart determining the interval detailed process of active time；

Fig. 4 is the schematic diagram illustrating microblogging quantity statistics；

Fig. 5 is to illustrate the flow chart that the topic shown in Fig. 1 determines the detailed process of step；

Fig. 6 is the schematic diagram illustrating topic cluster result；

Fig. 7 is the block diagram illustrating the functional configuration of data handling equipment according to an embodiment of the invention；

Fig. 8 is the block diagram of the example of detailed functions configuration illustrating the active time interval determination unit shown in Fig. 7；

Fig. 9 is the block diagram of another example of detailed functions configuration illustrating the active time interval determination unit shown in Fig. 7；

Figure 10 is the block diagram of the detailed functions configuration illustrating the topic determining unit shown in Fig. 7；And

Figure 11 is the example of the personal computer being shown as the messaging device employed in embodiments of the invention The block diagram of property structure.

Specific embodiment

Hereinafter in connection with accompanying drawing, the one exemplary embodiment of the present invention is described.For clarity and conciseness, All features of actual embodiment are not described in the description.It should be understood, however, that developing any this actual enforcement A lot of decisions specific to embodiment, to realize the objectives of developer, for example, symbol must be made during example Close those restrictive conditions related to system and business, and these restrictive conditions may have with the difference of embodiment Changed.Additionally, it also should be appreciated that although development is likely to be extremely complex and time-consuming, but to having benefited from the disclosure For those skilled in the art of content, this development is only routine task.

Here is in addition it is also necessary to illustrate is a bit, in order to avoid having obscured the present invention because of unnecessary details, in the accompanying drawings Illustrate only and the device structure closely related according to the solution of the present invention and/or process step, and eliminate and the present invention The little other details of relation.

Data processing method and equipment according to an embodiment of the invention to be described hereinafter with reference to Fig. 1 to Figure 10.

First, will be with reference to Fig. 1 description data processing method according to an embodiment of the invention.As shown in figure 1, data processing Method can include active time and determine that step S101, keyword extraction step S102 and topic determine step S103.

Specifically, it may be determined that the microblogging with similar active custom is used in active time interval determination step S101 Family group, and based on determined by the microblogging issued of concern user in microblog users group determine the work of each microblog users group Jump time interval.

Preferably, as described above, there are the different crowds that enliven in different time segment limits, for example, go to work for common Race, student or pensioner old man, because their daily schedule is different, thus it is interval to have dramatically different active time. Therefore it is necessary first to determine the microblog users group with similar active custom in vast microblog users, thus according to each In microblog users group concern user issued microblogging, for particular group and issue its topic of interest or information. To describe the handling process determining the microblog users group with similar active custom hereinafter with reference to Fig. 2 in detail.

As shown in Fig. 2 the active time shown in Fig. 1 is interval determining in step S101, determine that there is similar active custom Microblog users group may further include user vector and build sub-step S201, side and determine sub-step S202, microblog users group Build sub-step S203 and concern user determines sub-step S204.

Specifically, first, build in sub-step S201 in user vector, can be according to the conventional issuing microblog of microblog users T/A has the user vector of predetermined dimensions to build.As an example, can be in units of hour, using daily as system Count interval to build the user vector of 24 dimensions.Specifically, each user vector can be represented as V=（N1, n2 ..., n24）, Wherein, ni represents the quantity of the issuing microblog of each microblog users within this period.Although it should be understood that here in units of hour To build the user vector of 24 dimensions, but this is only exemplary rather than limiting, and can build more or less dimension as desired User vector.

Next, determining in sub-step S202 on side, structure in sub-step S201 can be built based in user vector Similarity between each user vector, determines the side between user node.

Preferably, determine in sub-step S202 on side, each user can be determined with the method based on co sinus vector included angle Vector between similarity, and by determined by similarity be more than predetermined threshold two user nodes between side be defined as Formal side.

Specifically, for example, for any two user vector V1=（N1, n2 ..., n24）, V2=（P1, p2 ..., p24）, Following formula can be passed through（1）To calculate the similarity between user vector V1 and V2：

CosVal=(n1*p1+n2*p2+…+n24*p24)/sqrt(n1*n1+n2*n2+…+n24*n24)*sqrt(p1* p1+p2*p2+…+p24*p24)（1）

Wherein sqrt represents extraction of square root computing, and cosval represents the similarity between user.Preferably, if cosval> M, then by between this two users, when being defined as formal, wherein m is predetermined threshold value.

Next, building in sub-step S203 in microblog users group, can determine in sub-step S202 really based at edge Formal side between fixed user node, is built using the figure partitioning algorithm of CNM etc. and has the micro- of similar active custom Rich customer group, for example, can be expressed as C=（V1,V2,…,Vr）.

Subsequently, determine in sub-step S204 in concern user, can vermicelli quantity based on each microblog users, issue micro- The forwarding quantity of rich quantity, the reply quantity to the microblogging that this microblog users is issued and the microblogging that microblog users are issued One or more of, determine the technorati authority of this microblog users, thus based on determined by technorati authority micro- from constructed each The microblog users selecting predetermined quantity in rich customer group are as concern user.

For example, in the case of using microblogging quantity b of vermicelli quantity a of microblog users and issue as Consideration, can With by following formula（2）To calculate the technorati authority of this microblog users：

Authority=Log(b)*Log(a)（2）

Wherein, Authority represents the technorati authority of microblog users, and log is logarithm operation.Preferably, each microblogging can be taken In customer group technorati authority size for example front 50% user as concern user, i.e. as significant object of statistics.Ying Li Solution, this technorati authority computational methods is merely illustrative and unrestricted.

By processing it is determined that having the microblog users of similar active custom in above-mentioned steps S201 to step S204 Group, and further define the concern user in each microblog users group.Hereinafter with reference to Fig. 3 be described in shown in Fig. 1 when enlivening Between determine in step S101 based on determined by the microblogging issued of concern user in microblog users group determine that each microblogging is used The interval detailed process of the active time of family group.

As shown in figure 3, active time shown in Fig. 1 is interval determining in step S101, be based on determined by microblog users The microblogging that concern user in group issues can include microblogging quantity system come the active time interval to determine each microblog users group Meter sub-step S301, sequence recursive subdivision sub-step S302 and active time interval selection sub-step S303.

First, in microblogging quantity statistics sub-step S301, can count and be determined within each period of predetermined period Concern user issue microblogging quantity, thus obtaining the microblogging quantity series with time correlation.Even if preferably due to right In same user, it is also likely to be dramatically different with the work and rest at weekend on weekdays, and therefore this statistics can be directed to working day Carry out respectively with weekend, so that this statistical work is more reasonable, more accurately to carry out topic excavation.Here, as showing Example, using one day as predetermined period, with minute for interval, determines the microblogging quantity series with time correlation.Using transverse axis as when Between and for example with minute for interval, and using issue microblogging quantity as the longitudinal axis, thus obtaining statistics for example as indicated at 4 Figure, wherein, Fig. 4（a）Represent and be directed to workaday cartogram, and Fig. 4（b）Represent the cartogram for weekend.Therefore, really Each element in fixed microblogging quantity series, in sequence（That is, microblogging quantity）Corresponding with each period.

Next, in sequence recursive subdivision sub-step S302, can be to the microblogging quantity sequence of statistics in step S301 Row carry out recursive subdivision, thus obtaining one or more cut-points.

Specifically, in sequence recursive subdivision sub-step S302, it is carried out as follows recursive subdivision：

Firstly, for the every bit in current sequence, according to following formula（3）With（4）Calculated：

AnthorV(i)=|L1(i)|*Var(L1(i))/|L|+|L2(i)|*Var(L2(i))/|L|（3）

DiffV(i)=Var(L(i))-AnthorV(i)（4）

Wherein, | L1 (i) |, | L2 (i) | represent that supposition i is current cut-point to two obtaining after current sequence segmentation respectively The length of individual subsequence, | L | represents the length of current sequence, and Var () represents the standard variance of current sequence or subsequence, wherein Variance is less then it represents that this sequence is more uniform.

Next, finding the maximum point of DiffV (i) in current sequence.If the DiffV (i) of maximum is less than predetermined threshold Value, then stop recursive subdivision, otherwise, then as cut-point, current sequence is divided into two using the maximum point of DiffV (i) in sequence Subsequence, and respectively recursive subdivision is proceeded to this two subsequences in a similar manner, it is hereby achieved that one or many Individual cut-point.The purpose of this series of processes is to find the interval that the quantity of user's issuing microblog is uprushed, that is, user Active time is interval, the time interval that the quantity of issuing microblog for example as shown in Figure 4 is uprushed.

Next the determination interval continuing on active time is processed.Specifically, the active time shown in Fig. 3 is interval Select sub-step S303 in, can determined based on obtained cut-point in sequence recursive subdivision sub-step S302 when Between in interval, the larger top n time interval of selection standard variance is interval as the active time of this microblog users group, wherein N It is the predetermined value more than or equal to 1.

After determine the active time interval of specific microblog users group according to a series of above-mentioned process, need true further Determine these users in the interval topic of interest of different active time, to improve the efficiency of data processing, enabling there is pin Row information issue and acquisition are entered to property.Next, referring back to Fig. 1, will be continuing on counting according to an embodiment of the invention According to processing method.

In keyword extraction step S102, can be all micro- in the active time interval determining step S101 Rich extraction key word.This keyword extracting method for example can include participle, stop words filters etc., and those skilled in the art are permissible Execute this process using any suitable keyword extraction techniques well known in the art, will not be described here.

Next, determining in step S103 in topic, institute can be determined based on the key word being extracted in step s 102 The corresponding topic in active time interval determining.

Hereinafter with reference to Fig. 5, the detailed process that topic determines step to be described.

As shown in figure 5, topic determines that step S103 can include candidate key word list and determine sub-step S501, key word Relatedness computation sub-step S502, figure construction sub-step S503 and topic determine sub-step S504.

First, determine in sub-step S501 in candidate key word list, can be directed to determined by active time interval, meter Calculate the weight of each key word being extracted, and the key word that weight is more than predetermined threshold is included into the interval time of this active time Select in lists of keywords.

Specifically, for determined by active time interval, for example can pass through following formula（5）Extracted to calculate Each key word weight：

W(k)=count(k)*log(Q/counttimes(k))*log(authorfollowers(k))（5）

Wherein, count (k) represents the occurrence number of key word k, and Q represents the microblogging quantity in described active time interval, Counttimes (k) represents microblogging number key word k, and authorfollowers (k) represents that issue includes key word k's The vermicelli sum of the people of microblogging.Here logarithm operation is to affect the accuracy of result to prevent the fluctuation of vermicelli number too big.

Next, calculating in sub-step S502 in keyword relevance, can calculate and waiting determined by step S501 Select the degree of association between any two key word in lists of keywords.

Specifically, as an example, following formula can be passed through（6）To calculate the degree of association between two key words：

I(A,B)=log(p(A,B))/(log(P(A))*log(P(B)))

Wherein, P (A), P (B) be illustrated respectively in active time interval in, with respect to whole microblogging numbers, occur key word A or The probability of the microblogging of B, P (A, B) represents in described active time interval, with respect to whole microblogging numbers, key word A simultaneously Probability with the microblogging of B.

Next, in figure construction sub-step S503, can be with the candidate key word list of determination in step S501 Each key word be node, to calculate degree of association more than predetermined threshold in step S502 as the side between key word Carry out structural map.

Then, determine in sub-step S504 in topic, can be calculated using cluster based on the figure being constructed in step S503 Method is determining the corresponding topic in each active time interval.Preferably, can be carried out using CNM figure partitioning algorithm here Topic clusters.For example as shown in fig. 6, wherein, different colors represents different topic clusters to the topic dendrogram finally giving.Example As, the topic such as air quality, pollution, environmental protection is the topic relevant with environmental conservation, and reform, enter a higher school, taking an examination etc. topic be with Educate relevant topic.

Although describing data processing method according to embodiments of the present invention, ability in detail above in conjunction with accompanying drawing 1-6 The technical staff in domain should be understood that what flow chart shown in the drawings was merely exemplary, and can be according to practical application and tool The difference that body requires, is changed accordingly to said method flow process.For example, as needed, can be to certain in said method The execution sequence of a little steps is adjusted, or can save or add some process steps.Additionally, above-described key The computational methods of the degree of association between word weight, key word etc. are merely illustrative and unrestricted, and can adopt known in this field Other technology calculating.

Corresponding with data processing method according to embodiments of the present invention, the embodiment of the present invention additionally provides at a kind of data Reason equipment.

Specifically, as shown in fig. 7, data handling equipment 700 can include active time interval determination unit 701, key Word extraction unit 702 and topic determining unit 703.The functional configuration of unit described in detail below.

Active time interval determination unit 701 may be configured to determine the microblog users group with similar active custom, And based on determined by the microblogging issued of concern user in microblog users group determine when enlivening of each microblog users group Between interval.

Preferably, as shown in figure 8, this active time interval determination unit 701 may further include user vector structure Subelement 801, side determination subelement 802, microblog users group build subelement 803 and concern user's determination subelement 804.With Under will be described in the functional configuration of each subelement.

User vector builds subelement 801 and may be configured to the T/A according to the conventional issuing microblog of microblog users To build the user vector with predetermined dimensions.Here, as an example, the user vector V=of 24 dimensions can be built（N1, N2 ..., n24）, wherein, ni represents the quantity of the issuing microblog of each microblog users within this period.

Side determination subelement 802 be configured to similarity between each user vector come to determine user node it Between side.Preferably as example, determine the similarity between each user vector with the method based on co sinus vector included angle, And by determined by similarity be more than predetermined threshold two user nodes between when being defined as formal.

Microblog users group build subelement 803 be configured to determined by formal side between user node, Build the microblog users group with similar active custom using the figure partitioning algorithm of CNM etc., for example, can be expressed as C= （V1, V2 ..., Vr）.

Concern user's determination subelement 804 is configured to the vermicelli quantity of each microblog users, the microblogging issued In the forwarding quantity of quantity, the reply quantity to the microblogging that this microblog users is issued and the microblogging that microblog users are issued One or more, determine the technorati authority of this microblog users, thus based on determined by technorati authority from each constructed microblogging The microblog users selecting predetermined quantity in customer group are as concern user.As an example, for example can be with the vermicelli of microblog users Quantity and the microblogging quantity issued are as Consideration, and take technorati authority size in microblog users group to make in front 50% user For paying close attention to user.

Preferably, as shown in figure 9, active time interval determination unit 701 can further include microblogging quantity statistics Subelement 901, sequence recursive subdivision subelement 902 and active time interval selection subelement 903.

Microblogging quantity statistics subelement 901 may be configured to statistics in predetermined period（For example, one day）Each period The quantity of the microblogging that concern user determined by interior issues, thus obtain the microblogging quantity series with time correlation.Preferably, should Statistical work can be carried out, so that statistical result is more scientific and reasonable respectively for working day and weekend.

Sequence recursive subdivision subelement 902 may be configured to carry out recursive subdivision to the microblogging quantity series being counted, Thus obtaining one or more cut-points.

Active time interval selection subelement 903 be may be configured in the time being determined based on obtained cut-point In interval, the larger top n time interval of selection standard variance is interval as the active time of this microblog users group, and wherein N is Predetermined value more than or equal to 1.

Next, referring back to Fig. 7, by each list continuing on data handling equipment according to an embodiment of the invention The functional configuration of unit.

Keyword extracting unit 702 may be configured to from determined by all microbloggings in active time interval extract and close Keyword.Keyword extracting method can be it is known in the art that will not be described here.

Topic determining unit 703 is configured to extracted key word, determines the work of each microblog users group Corresponding topic in jump time interval.

With reference to Figure 10, topic determining unit 703 can include candidate key word list determination subelement 1001, key word phase Pass degree computation subunit 1002, figure construction subelement 1003 and topic determination subelement 1004.Described in detail below each The functional configuration of subelement.

Specifically, candidate key word list determination subelement 1001 may be configured to be directed to determined by active time area Between, calculate the weight of each key word being extracted, and the key word that weight is more than predetermined threshold is included into this active time area Between candidate key word list in.

Keyword relevance computation subunit 1002 may be configured in candidate key word list determined by calculating Degree of association between any two key word.

Candidate key word list determination subelement 1001 and keyword relevance computation subunit 1002 are adopted Keyword weight computational methods and relatedness computation method, may be referred to determine sub-step above in relation to candidate key word list Method employed in S501 and keyword relevance calculating sub-step S502, here is not repeated to describe.

Figure construction subelement 1003 may be configured to determined by each key word in candidate key word list be Node, carrys out structural map more than the degree of association of predetermined threshold as the side between key word calculating.

Topic determination subelement 1004 is configured to constructed figure, to determine each work using clustering algorithm Corresponding topic in jump time interval.Preferably, clustering algorithm can be CNM figure partitioning algorithm.The topic cluster finally giving Result is for example as shown in Figure 6.

It should be noted that the equipment described in the embodiment of the present invention is corresponding with preceding method embodiment, therefore, if The part not described in detail in standby embodiment, refers to the introduction of relevant position in embodiment of the method, repeats no more here.

In addition, it should also be noted that above-mentioned series of processes and equipment can also be realized by software and/or firmware.? In the case of being realized by software and/or firmware, from storage medium or network to the computer with specialized hardware structure, for example General purpose personal computer 1100 shown in Figure 11 installs the program constituting this software, this computer when being provided with various program, It is able to carry out various functions etc..

In fig. 11, CPU（CPU）1101 according to read only memory（ROM）In 1102 storage program or from Storage part 1108 is loaded into random access memory（RAM）The 1103 various process of program performing.In RAM 1103, also root Store the data required when CPU 1101 executes various process etc. according to needs.

CPU 1101, ROM 1102 and RAM 1103 are connected to each other via bus 1104.Input/output interface 1105 also connects It is connected to bus 1104.

Components described below is connected to input/output interface 1105：Importation 1106, including keyboard, mouse etc.；Output section Divide 1107, including display, such as cathode ray tube（CRT）, liquid crystal display（LCD）Etc., and speaker etc.；Storage part Divide 1108, including hard disk etc.；With communications portion 1109, including NIC such as LAN card, modem etc..Logical Letter part 1109 is via network such as the Internet execution communication process.

As needed, driver 1110 is also connected to input/output interface 1105.Detachable media 1111 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed in driver 1110 computer so that reading out as needed Program is installed in storage part 1108 as needed.

In the case that above-mentioned series of processes is realized by software, such as removable from network such as the Internet or storage medium Unload medium 1111 and the program constituting software is installed.

It will be understood by those of skill in the art that this storage medium is not limited to the journey that is wherein stored with shown in Figure 11 Sequence and equipment are separately distributed to provide a user with the detachable media 1111 of program.The example bag of detachable media 1111 Containing disk（Comprise floppy disk（Registered trade mark））, CD（Comprise compact disc read-only memory（CD-ROM）And digital universal disc（DVD））、 Magneto-optic disk（Comprise mini-disk（MD）（Registered trade mark））And semiconductor memory.Or, storage medium can be ROM 1102, deposit Hard disk comprising in storage part 1108 etc., wherein computer program stored, and it is distributed to user together with the equipment comprising them.

It may also be noted that execute above-mentioned series of processes step can order naturally following the instructions temporally suitable Sequence executes, but does not need necessarily to execute sequentially in time.Some steps can execute parallel or independently of one another.

Although the present invention and its advantage have been described in detail it should be appreciated that without departing from by appended claim Various changes, replacement and conversion can be carried out in the case of the spirit and scope of the present invention being limited.And, the present invention is implemented The term " inclusion " of example, "comprising" or its any other variant are intended to comprising of nonexcludability, so that including one The process of list of elements, method, article or equipment not only include those key elements, but also other including being not expressly set out Key element, or also include for this process, method, article or the intrinsic key element of equipment.In the feelings not having more restrictions Under condition, the key element that limited by sentence "including a ..." it is not excluded that include the process of described key element, method, article or Also there is other identical element in person's equipment.

With regard to including the embodiment of above example, following remarks are also disclosed：

A kind of data processing method of remarks 1., including：

Active time is interval to determine step, for determining the microblog users group with similar active custom, and is based on institute The microblogging that concern user in the microblog users group determining issues is interval come the active time to determine each microblog users group；

Keyword extraction step, for from determined by all microbloggings in active time interval extract key words；And

Topic determines step, for based on the key word being extracted, corresponding in active time interval determined by determination Topic.

Data processing method according to remarks 1 for the remarks 2., wherein, in described active time interval determination step, Determine that the microblog users group with similar active custom further includes：

User vector builds sub-step, has for being built according to the T/A of the conventional issuing microblog of microblog users The user vector of predetermined dimensions；

While determining sub-step, based on the similarity between each user vector, determine the side between user node；

Microblog users group builds sub-step, for side determined by being based on, builds and has the microblogging of similar active custom and use Family group；And

Concern user determines sub-step, for the vermicelli quantity based on each microblog users, the microblogging quantity, micro- to this issued One of the reply quantity of the microblogging that rich user is issued and the forwarding quantity to the microblogging that this microblog users is issued or Multiple, determine the technorati authority of this microblog users, thus predetermined quantity is selected from described microblog users group based on described technorati authority Microblog users as described concern user.

Data processing method according to remarks 2 for the remarks 3., wherein, determines in sub-step on described side, with based on to The method of amount included angle cosine determining the similarity between each user vector, and will determined by similarity be more than predetermined threshold Two user nodes between when being defined as formal.

Data processing method according to remarks 1 for the remarks 4., wherein, in described active time interval determination step, Based on determined by the microblogging issued of concern user in microblog users group determine the active time area of each microblog users group Between further include：

Microblogging quantity statistics sub-step, pays close attention to the micro- of user's issue for statistics is described within each period of predetermined period Rich quantity, thus obtain the microblogging quantity series with time correlation；

Sequence recursive subdivision sub-step, for carrying out recursive subdivision to the microblogging quantity series being counted, thus obtain one Individual or multiple cut-points；And

Active time interval selection sub-step, for selecting mark in the time interval determining based on obtained cut-point The larger top n time interval of quasi- variance is interval as described active time, and wherein N is more than or equal to 1,

Wherein, in described sequence recursive subdivision sub-step：

For each point in current sequence, calculated according to below equation：

AnthorV(i)=|L1(i)|*Var(L1(i))/|L|+|L2(i)|*Var(L2(i))/|L|

DiffV(i)=Var(L(i))-AnthorV(i)

Find out the maximum point of DiffV (i) in current sequence；And

Data processing method according to remarks 2 for the remarks 5., wherein, described statistics is for working day and weekend respectively Carry out.

Data processing method according to remarks 1 for the remarks 6., wherein, described topic determines that step further includes：

Candidate key word list determines sub-step, interval for active time determined by being directed to, and calculates extracted each The weight of individual key word, and the key word that weight is more than predetermined threshold is included into the interval candidate keywords row of described active time In table；

Keyword relevance calculates sub-step, crucial for any two in candidate key word list determined by calculating Degree of association between word；

Figure construction sub-step, for each key word in described candidate key word list as node, with more than predetermined The degree of association of threshold value carrys out structural map as the side between key word；And

Topic determines sub-step, for based on the figure being constructed, using clustering algorithm, active time area determined by determination Interior corresponding topic.

Data processing method according to remarks 6 for the remarks 7., wherein, determines sub-step in described candidate key word list In, interval for described active time, the weight of each key word is calculated according to below equation：

W(k)=count(k)*log(Q/counttimes(k))*log(authorfollowers(k))

Wherein, count (k) represents the occurrence number of key word k, and Q represents the microblogging quantity in described active time interval, Counttimes (k) represents microblogging number key word k, and authorfollowers (k) represents that issue includes key word k's The vermicelli sum of the people of microblogging.

Method according to remarks 6 for the remarks 8., wherein, described keyword relevance calculate sub-step in, by with Lower formula calculates the degree of association between two key words：

I(A,B)=log(p(A,B))/(log(P(A))*log(P(B)))

Wherein, P (A), P (B) are illustrated respectively in described active time interval, with respect to whole microblogging numbers, occur crucial The probability of the microblogging of word A or B, P (A, B) represents in described active time interval, with respect to whole microblogging numbers, occurs closing simultaneously The probability of the microblogging of keyword A and B.

Data processing method according to remarks 6 for the remarks 9., wherein, described clustering algorithm includes CNM figure partitioning algorithm.

A kind of data handling equipment of remarks 10., including：

Active time interval determination unit, is configured to determine the microblog users group with similar active custom, and base In determined by the microblogging issued of concern user in microblog users group to determine that the active time of each microblog users group is interval；

Keyword extracting unit, be configured to from determined by all microbloggings in active time interval extract key words； And

Topic determining unit, is configured to based on the key word being extracted, in active time interval determined by determination Corresponding topic.

Data handling equipment according to remarks 10 for the remarks 11., wherein, described active time interval determination unit enters one Step includes：

User vector builds subelement, is configured to be built according to the T/A of the conventional issuing microblog of microblog users There is the user vector of predetermined dimensions；

Side determination subelement, is configured to the similarity between each user vector, determines the side between user node；

Microblog users group builds subelement, be configured to be based on determined by side, build and there is the micro- of similar active custom Rich customer group；And

Concern user's determination subelement, is configured to vermicelli quantity based on each microblog users, the microblogging quantity issued, right The reply quantity of the microblogging that this microblog users is issued and in the forwarding quantity of the microblogging that this microblog users is issued Individual or multiple, determine the technorati authority of this microblog users, thus select predetermined from described microblog users group based on described technorati authority The microblog users of quantity are as described concern user.

Data handling equipment according to remarks 11 for the remarks 12., wherein, described side determines that sub-step is configured to be based on The method of co sinus vector included angle determining the similarity between each user vector, and will determined by similarity be more than predetermined threshold Between two user nodes of value when being defined as formal.

Data handling equipment according to remarks 10 for the remarks 13., wherein, described active time interval determination unit enters one Step includes：

Microblogging quantity statistics subelement, is configured to statistics described concern user within each period of predetermined period and issues Microblogging quantity, thus obtaining the microblogging quantity series with time correlation；

Sequence recursive subdivision subelement, is configured to carry out recursive subdivision to the microblogging quantity series being counted, thus To one or more cut-points；And

Active time interval selection subelement, is configured to select in the time interval determining based on obtained cut-point Select the larger top n time interval of standard variance interval as described active time, wherein N is more than or equal to 1,

Wherein, described sequence recursive subdivision subelement is further configured to：

For each point in current sequence, calculated according to below equation：

AnthorV(i)=|L1(i)|*Var(L1(i))/|L|+|L2(i)|*Var(L2(i))/|L|

DiffV(i)=Var(L(i))-AnthorV(i)

Find out the maximum point of DiffV (i) in current sequence；And

Data handling equipment according to remarks 11 for the remarks 14., wherein, described statistics is for working day and week respectively End is carried out.

Data handling equipment according to remarks 10 for the remarks 15., wherein, described topic determining unit further includes：

Candidate key word list determination subelement, be configured to be directed to determined by active time interval, calculate and extracted Each key word weight, and the key word that weight is more than predetermined threshold is included into the interval candidate key of described active time In word list；

Keyword relevance computation subunit, is configured to any two in candidate key word list determined by calculating Degree of association between key word；

Figure construction subelement, is configured to each key word in described candidate key word list as node, to be more than The degree of association of predetermined threshold carrys out structural map as the side between key word；And

Topic determination subelement, is configured to based on the figure being constructed, using clustering algorithm, when enlivening determined by determination Between interval in corresponding topic.

Data handling equipment according to remarks 15 for the remarks 16., wherein, described candidate key word list determination subelement It is further configured to interval for described active time, calculate the weight of each key word according to below equation：

W(k)=count(k)*log(Q/counttimes(k))*log(authorfollowers(k))

Data handling equipment according to remarks 15 for the remarks 17., wherein, described keyword relevance computation subunit is entered One step is configured to calculate the degree of association between two key words by below equation：

I(A,B)=log(p(A,B))/(log(P(A))*log(P(B)))

Data handling equipment according to remarks 15 for the remarks 18., wherein, described clustering algorithm includes CNM figure and divides calculation Method.

A kind of terminal unit of remarks 19., described terminal unit includes the data according to any one of remarks 10 to 18 Processing equipment.

Terminal unit according to remarks 19 for the remarks 20., wherein, described terminal unit includes mobile phone, palm electricity Brain, panel computer and personal computer.

Claims

1. a kind of data processing method, including：

Active time is interval to determine step, for determining the microblog users group with similar active custom, and based on being determined Microblog users group in the microblogging issued of concern user interval come the active time to determine each microblog users group；

Keyword extraction step, for for described microblog users group, from determined by all microbloggings in active time interval Extract key word；And

Topic determines step, for based on the key word being extracted, determine described microblog users group determined by active time Corresponding topic in interval,

Wherein, in described active time interval determination step, determine that the microblog users group with similar active custom is further Including：

User vector build sub-step, for built according to the T/A of the conventional issuing microblog of microblog users have predetermined The user vector of dimension；

Microblog users group builds sub-step, for side determined by being based on, builds the microblog users group with similar active custom； And

Concern user determine sub-step, for the vermicelli quantity based on each microblog users, issue microblogging quantity, to this microblogging use One or more of the reply quantity of the microblogging that family is issued and the forwarding quantity to the microblogging that this microblog users is issued, Determine the technorati authority of this microblog users, thus select the microblogging of predetermined quantity from described microblog users group based on described technorati authority User is as described concern user.

2. data processing method according to claim 1, wherein, in described active time interval determination step, is based on Determined by the microblogging issued of concern user in microblog users group enter come the active time interval to determine each microblog users group One step includes：

Microblogging quantity statistics sub-step, for the statistics microblogging that described concern user issues within each period of predetermined period Quantity, thus obtain the microblogging quantity series with time correlation；

Sequence recursive subdivision sub-step, for recursive subdivision is carried out to the microblogging quantity series being counted, thus obtain one or Multiple cut-points；And

Active time interval selection sub-step, for selection standard side in the time interval being determined based on obtained cut-point The larger top n time interval of difference is interval as described active time, and wherein N is more than or equal to 1,

Wherein, in described sequence recursive subdivision sub-step：

For each point in current sequence, calculated according to below equation：

AnthorV (i)=| L1 (i) | * Var (L1 (i))/| L |+| L2 (i) | * Var (L2 (i))/| L |

DiffV (i)=Var (L (i))-AnthorV (i)

Wherein, | L1 (i) |, | L2 (i) | represent that supposition i is current cut-point to two sons obtaining after current sequence segmentation respectively The length of sequence, | L | represents the length of current sequence, and Var () represents the standard variance of current sequence or subsequence；

Find out the maximum point of DiffV (i) in current sequence；And

If the DiffV (i) of this point be less than predetermined threshold, stop recursive subdivision, otherwise take this as current sequence minute Current sequence is divided into two subsequences by cutpoint, and continues to carry out recursive subdivision respectively to this two subsequences.

3. data processing method according to claim 1, wherein, described topic determines that step further includes：

Candidate key word list determines sub-step, interval for active time determined by being directed to, and calculates each pass extracted The weight of keyword, and the key word that weight is more than predetermined threshold is included into the interval candidate key word list of described active time In；

Keyword relevance calculates sub-step, for any two key word in candidate key word list determined by calculating it Between degree of association；

Figure construction sub-step, for each key word in described candidate key word list as node, to calculate more than pre- The degree of association determining threshold value carrys out structural map as the side between key word；And

Topic determines sub-step, and for based on the figure being constructed, using clustering algorithm, active time determined by determination is interval interior Corresponding topic.

4. a kind of data handling equipment, including：

Active time interval determination unit, is configured to determine the microblog users group with similar active custom, and is based on institute The microblogging that concern user in the microblog users group determining issues is interval come the active time to determine each microblog users group；

Keyword extracting unit, is configured to for described microblog users group, from determined by all in active time interval Microblogging extracts key word；And

Topic determining unit, is configured to based on the key word being extracted, determine described microblog users group determined by active Corresponding topic in time interval,

Wherein, described active time interval determination unit further includes：

User vector builds subelement, is configured to be built according to the T/A of the conventional issuing microblog of microblog users and has The user vector of predetermined dimensions；

Microblog users group builds subelement, be configured to be based on determined by side, build and there is the microblogging of similar active custom use Family group；And

Concern user's determination subelement, is configured to vermicelli quantity based on each microblog users, the microblogging quantity issued, micro- to this One of the reply quantity of the microblogging that rich user is issued and the forwarding quantity to the microblogging that this microblog users is issued or Multiple, determine the technorati authority of this microblog users, thus predetermined quantity is selected from described microblog users group based on described technorati authority Microblog users as described concern user.

5. data handling equipment according to claim 4, wherein, described active time interval determination unit wraps further Include：

Microblogging quantity statistics subelement, be configured to count within each period of predetermined period described concern user issue micro- Rich quantity, thus obtain the microblogging quantity series with time correlation；

Sequence recursive subdivision subelement, is configured to carry out recursive subdivision to the microblogging quantity series being counted, thus obtaining one Individual or multiple cut-points；And

Active time interval selection subelement, is configured to select mark in the time interval determining based on obtained cut-point The larger top n time interval of quasi- variance is interval as described active time, and wherein N is more than or equal to 1,

For each point in current sequence, calculated according to below equation：

AnthorV (i)=| L1 (i) | * Var (L1 (i))/| L |+| L2 (i) | * Var (L2 (i))/| L |

DiffV (i)=Var (L (i))-AnthorV (i)

Find out the maximum point of DiffV (i) in current sequence；And

6. data handling equipment according to claim 4, wherein, described topic determining unit further includes：

Candidate key word list determination subelement, be configured to be directed to determined by active time interval, calculate extracted each The weight of individual key word, and the key word that weight is more than predetermined threshold is included into the interval candidate keywords row of described active time In table；

Keyword relevance computation subunit, any two being configured in candidate key word list determined by calculating is crucial Degree of association between word；

Figure construction subelement, is configured to big with each key word in described candidate key word list as node, with calculate Degree of association in predetermined threshold carrys out structural map as the side between key word；And

Topic determination subelement, is configured to based on the figure being constructed, using clustering algorithm, active time area determined by determination Interior corresponding topic.

7. data handling equipment according to claim 6, wherein, described candidate key word list determination subelement is further It is configured to interval for described active time, calculate the weight of each key word according to below equation：

W (k)=count (k) * log (Q/counttimes (k)) * log (authorfollowers (k))

8. a kind of terminal unit, the data processing that described terminal unit is included according to any one of claim 4 to 7 sets Standby.