CN103631809A

CN103631809A - Data clustering device and method

Info

Publication number: CN103631809A
Application number: CN201210305587.8A
Authority: CN
Inventors: 庄惟尧
Original assignee: Acer Inc
Current assignee: Acer Inc
Priority date: 2012-08-24
Filing date: 2012-08-24
Publication date: 2014-03-12

Abstract

The invention provides a data clustering device and method. The data clustering device comprises a news database used for storing a plurality of data, a calculation module, a clustering module and a comparison module. The calculation module is used for establishing an overall silhouette style sheet according to a distance relationship between the data and obtaining a preliminary clustering reference number according to the overall silhouette style sheet. The clustering module is used for dividing the data into a plurality of clusters by the utilization of a clustering algorithm according to the preliminary clustering reference number and then is used for calculating the average distance within each cluster. The comparison module is used for comparing the average distance within the clusters with a threshold value, and if the average distance within the clusters is smaller than the threshold value, the clusters corresponding to the average distance within the clusters are stored in an event database. According to the data clustering device and method, mixed and disordered news is clustered so as to obtain clusters with news events which are from simple information fusion sources of different news but are similar, and therefore the purpose of improving news event clustering accuracy is achieved.

Description

Data grouping device and method

Technical field

The present invention mainly relates to a kind of data technology of hiving off, and particularly relates to can utilizing a file automatically to detect the grouping method (Auto-detect Text Recursively Clusting, ADTR) of pulling over to carry out the technology that data are hived off.

Background technology

In recent years, fast development due to wireless communication technology, therefore, Portable miscellaneous and hand-held device, such as: the electronic goods such as mobile phone, intelligent mobile phone (smart phone), personal digital assistant (PDA), panel computer (Tablet PC) are constantly pushed out on market, and the also more and more polynary change of the function of these electronic goods.In addition,, due to the facility of these devices, also make these devices become one of daily necessities of people.

Except the hardware of aforementioned various radio communication devices, also have manyly to coordinate aforementioned hardware to carry out with the software of applying and function to be constantly developed, with allow user can be more convenient, more instant and manage money matters more anywhere or anytime, work, amusement or acquisition information etc.General along with mobile network's universal hand-held moving device, modern, when mobile, as get a lift, reads news on rapid transit become important trend by mobile network.There is now numerous source of news website, add the each have their own sortord of simple and easy information fusion (Really Simple Syndication, RSS) due to the news that each family's media provide at present, very numerous and jumbled.Although can easily obtain numerous media event, cannot follow the trail of or classifying importance according to media event.In addition, the application software of reading at present news is mainly simple and easy information fusion (RSS) source for news, and news superclass is carried out presenting of news.Cause reader in reading experience, be not easy to find the own media event of being concerned about, be also not easy to find important at present media event.

In addition, writing of Chinese news is non-structured form, therefore, at artificial intelligence automatic classification or while hiving off, is difficult to easily similar article is judged as to a group; On the other hand, while hiving off often different news easily because part is in same a group compared with under-represented words, if while making to find out same media event, become and be comparatively not easy.In addition, when data are hived off, several decision of trooping is very very difficult often, generally by predefined or observe in advance, no matter and any method all need manually to go to participate in assisting.

Summary of the invention

Because the problem of above-mentioned prior art, the invention provides a kind of data technology of hiving off, particularly can utilize a file automatically to detect the grouping method (Auto-detect Text Recursively Clusting, ADTR) of pulling over and carry out the technology that data are hived off.

According to one embodiment of the invention, provide a kind of data clustering method, comprised the following steps: by a news database, obtained a plurality of data; According to the distance relation between above-mentioned data, set up a whole silhouette style sheet, to obtain one, tentatively hive off with reference to number; With reference to number, utilize the algorithm of hiving off that a plurality of data are divided into a plurality of trooping according to above-mentioned tentatively hiving off; Calculate mean distance in each above-mentioned a group of trooping; And whether mean distance is less than a threshold value in more above-mentioned group, if wherein the interior mean distance of above-mentioned group is less than above-mentioned threshold value, deposit above-mentioned the trooping of mean distance in the above-mentioned group of correspondence in an event database.

According to one embodiment of the invention, provide a kind of data grouping device, having comprised: a news database, in order to store a plurality of data; One computing module, according to the distance relation between above-mentioned data, in order to set up a whole silhouette style sheet, then according to above-mentioned whole silhouette style sheet, obtains one and tentatively hives off with reference to number; One module of hiving off, with reference to number, utilize the algorithm of hiving off that a plurality of data are divided into a plurality of trooping according to above-mentioned tentatively hiving off, calculate again mean distance (Intra-Cluster distance) in each above-mentioned a group of trooping, an and comparison module, in order to mean distance in more above-mentioned group, whether be less than a threshold value, if wherein in above-mentioned group, mean distance is less than above-mentioned threshold value, above-mentioned the trooping of mean distance in the above-mentioned group of correspondence deposited in an event database.

The present invention can hive off mixed and disorderly news, to obtain simple and easy information fusion (RSS) source of different news but trooping of similar media event, thereby reaches and promotes the hive off result of accuracy of media event.

Accompanying drawing explanation

Fig. 1 shows according to the Organization Chart of the data grouping device 100 described in one embodiment of the invention.

Fig. 2 shows according to the whole silhouette value described in one embodiment of the invention and schematic diagram corresponding to number of clusters.

Fig. 3 shows according to the process flow diagram 300 of the data clustering method described in one embodiment of the invention.

Fig. 4 is the process flow diagram 400 showing according to the whole silhouette style sheet of the foundation described in one embodiment of the invention.

Fig. 5 is the process flow diagram 500 that shows mean distance in the corresponding group that each is trooped according to the calculating described in one embodiment of the invention.

[main description of reference numerals]

100～data grouping device;

110～news database;

120～pretreatment module;

130～computing module;

140～the module of hiving off;

150～comparison module;

160～event database;

300,400,500～process flow diagram;

S310, S320, S330, S340, S350, S360, S370, S380, S410, S420, S430, S510, S520～step.

Embodiment

Fig. 1 shows according to the Organization Chart of the data grouping device 100 described in one embodiment of the invention.As shown in the figure, the data grouping device 100 according to described in one embodiment of the invention, comprises, a news database 110, a pretreatment module 120, a computing module 130, hive off module 140, a comparison module 150, an event database 160.

According to one embodiment of the invention, news database 110 is in order to storage and a plurality of data are provided, and the data that news database 110 is stored can immediately be upgraded, wherein data described herein can comprise all types of media events, similarly be world news, political news, social news, sports news, performing art news etc., also can comprise all kinds of different special topics reports or lteral data.

According to one embodiment of the invention, pretreatment module 110, in order to a plurality of data in advance that news database 110 is stored through a pre-treatment computing, namely a plurality of features of a plurality of data are carried out to a vectorization processing, make data can convert a spatial model to, the processing that after convenient, data are hived off, wherein feature described herein refers to the different key words that the content that comprises in data extracts after hyphenation or punctuate, for instance, by " global warming has caused iceberg, the arctic to dissolve, thereby make that sea level rise " this sentence, can be by " global warming ", " arctic ", iceberg, keyword extractors such as " also plane rise " out, after key word essence is taken out, again these key words are processed through vectorization, be converted to the vector point with different weights value, therefore, after processing via such vectorization, just can make original data to convert the spatial model with vector representation to by written form.

According to one embodiment of the invention, computing module 130 is in order to receive via pretreatment module 110 pre-treatments data later, and according to data the distance relation between spatial model, in order to set up a whole silhouette style sheet (Global Silhouette Pattern), according to set up whole silhouette style sheet, obtain one and tentatively hive off with reference to number again.Clearer and more definite, in this embodiment, computing module 130 is obtained the reference number object step of tentatively hiving off and is comprised in order to set up a whole silhouette style sheet: first, first with silhouette formula (as follows), according to troop middle data pitch from relation calculate a plurality of silhouette coefficients, wherein silhouette coefficient described herein is a kind of in order to assess the index of hive off validity and state, and it can be in order to present the good degree of cluster state.Then, for difference several grouping result of trooping, to produce corresponding one a plurality of whole silhouette value (Global Silhouette value, the GS that different number of clusters was had that troops number range _u), wherein above-mentioned number of clusters scope arrives between the sum of above-mentioned data between 2.Finally, computing module 130 can be set up whole silhouette style sheet according to a plurality of whole silhouette values, in order to record the whole silhouette value (GS of corresponding each number of clusters number _u), and the peaked number of clusters of corresponding silhouette value is set as tentatively hiving off with reference to number, detailed calculation process will illustrate down below.

Silhouette formula:

A computing i _ththe Silhouette coefficient of data:

1. calculate i _thmean distance (a of data point to every other data point in same trooping _i).

2. for i _thdata point is trooped with other, calculates the mean distance of these data each all data of trooping to other, and gets its minimum value (b _i).

3. calculate i _thsilhouette coefficient (S _i), its formula is defined as follows:

S_{i} = \frac{b_{i} - a_{i}}{\max {a_{i}, b_{i}}}

Wherein max operand is as computing and the above formula of denominator, to observe-1≤Si≤1 in order to get maximal value among ai and bi.

In order to try to achieve whole silhouette coefficient value (GS _u), computing module 130 obtains each silhouette value of trooping of trooping (Cluster Silhouette Value) of first calculating in corresponding each number of clusters, for a certain silhouette value (S that troops trooping in a certain number of clusters of correspondence _j) account form is as follows:

S_{j} = \frac{1}{m} Σ_{i = 1}^{m} S_{i},

Wherein m is present in single middle the comprised data number of trooping.

If it is example that the data of take are divided into c group's situation, namely number of clusters is in c situation, if will obtain whole silhouette coefficient value (GS _u), can obtain by calculating the average of all silhouette values of trooping of trooping.Whole silhouette coefficient value (GS _u) be defined as follows:

{GS}_{u} = \frac{1}{c} Σ_{j = 1}^{c} S_{j}

Fig. 2 shows according to the whole silhouette value schematic diagram corresponding with number of clusters described in one embodiment of the invention.As shown in Figure 2, if there are m data in news database 110, the number of clusters scope that represents required calculating is exactly to troop and troop to being divided into m by being divided into 2, computing module 130 will be according to calculating number of clusters scope, calculating is divided into 2 ~ m distinguished corresponding whole silhouette value of trooping by data, and calculated whole silhouette value is recorded in respectively in whole silhouette style sheet, if when can obtain the maximal value of silhouette value when being divided into N group, computing module 130 will be made as N group tentatively and hive off with reference to number.

According to one embodiment of the invention, the module of hiving off 140 is according to tentatively hiving off with reference to number, utilize the algorithm of hiving off that a plurality of data are divided into a plurality of trooping, calculate again mean distance (Intra-Cluster Distance) in each corresponding group that troops, the step that the module of wherein hiving off 140 is calculated mean distance in each group who troops comprises: first, and a central point of each included data in trooping in first compute vector space; Then calculate each data included in trooping to a mean distance of central point, the different mean distances that calculate represent mean distance in each group who troops again.In this embodiment, in group, mean distance is to utilize cosine distance (Cosine Distance) formula to try to achieve, and in group, mean distance can be in order to assess a cohesion of trooping.

In addition, what specify is, the algorithm of hiving off of using at above-described embodiment is the hierarchy type algorithm of hiving off, but with this algorithm, be not limited in the present invention, for any those skilled in the art, can be after consulting this instructions, by the algorithm of hiving off that other is applicable to, be substituted in hierarchy type that instructions the uses algorithm of hiving off, such as: with average (K-means) algorithm of the K in partition type grouping method (partitional clustering), K object (K-medoids) algorithm etc.

According to one embodiment of the invention, whether comparison module 150 is less than a threshold value (threshold) in order to mean distance in the group who relatively troops, if mean distance is less than above-mentioned threshold value in group, mean distance in corresponding group is less than to trooping of above-mentioned threshold value and deposits in an event database 160 in, if mean distance is not less than threshold value in group, carry out an action of pulling over and hiving off, namely mean distance in group is not less than to the included data of trooping of threshold value and again passes above-mentioned computing module 130 back, proceeding to calculate whole silhouette style sheet tentatively hives off with reference to number to obtain, then re-start again the flow process that other above-mentioned data grouping device 100 each modules are carried out, until all data all store in event database 160, just represent that all data have all hived off complete.Special instruction, about the setting of threshold value, for any those skilled in the art, can, after consulting this instructions, (for example: 0.2 ~ 0.3) be made as threshold value by suitable value.According to one embodiment of the invention, user can pass through a display unit (figure does not show) and search unit (figure does not show), by event database 160, obtained the data result having hived off via data grouping device 100, and result is presented on display unit.

Fig. 3 shows according to the process flow diagram 300 of the data clustering method described in one embodiment of the invention.First, at step S310, by a news database, obtain a plurality of data; At step S320, carry out a pre-treatment computing, so that a plurality of features of above-mentioned data are carried out to a vectorization processing, and make above-mentioned data-switching become a spatial model; At step S330, according to the distance relation between above-mentioned data, set up a whole silhouette style sheet, to obtain one, tentatively hive off with reference to number; At step S340, with reference to number, utilize the algorithm of hiving off that a plurality of data are divided into a plurality of trooping according to above-mentioned tentatively hiving off; At step S350, obtain mean distance in each above-mentioned a group of trooping; At step S360, in more above-mentioned group, whether mean distance is less than a threshold value; If mean distance is less than above-mentioned threshold value in above-mentioned group, carries out step S370 and deposit above-mentioned the trooping of mean distance in the above-mentioned group of correspondence in an event database; If mean distance is not less than above-mentioned threshold value in above-mentioned group, carry out step S380, above-mentioned the trooping of mean distance in the above-mentioned group of correspondence recalculated to above-mentioned silhouette coefficient, to obtain above-mentioned tentatively hiving off with reference to number, namely return step S330 and again proceed the step that data are hived off.In addition, what specify is, the algorithm of hiving off of using at above-described embodiment is the hierarchy type algorithm of hiving off, but with this algorithm, be not limited in the present invention, for any those skilled in the art, can be after consulting this instructions, by other suitable algorithm of hiving off, be substituted in hierarchy type that instructions the uses algorithm of hiving off, such as: with average (K-means) algorithm of the K in partition type grouping method (partitional clustering), K object (K-medoids) algorithm etc.

Fig. 4 is the process flow diagram 400 showing according to the whole silhouette style sheet of the foundation described in one embodiment of the invention.First, at step S410, the distance relation according to data in space vector, utilizes a silhouette formula to calculate, to produce a plurality of whole silhouette value of the different number of clusters of a corresponding number range, wherein above-mentioned number of clusters scope arrives between the sum of above-mentioned data between 2; At step S420, record above-mentioned whole silhouette value in whole silhouette style sheet; And at step S430, the peaked above-mentioned number of clusters of the above-mentioned whole silhouette value of correspondence is set as above-mentionedly tentatively hiving off with reference to number.

Fig. 5 is the process flow diagram 500 of obtaining mean distance in each corresponding group who troops showing according to described in one embodiment of the invention.First, at step S510, obtain a central point of above-mentioned data included in each above-mentioned trooping; At step S520, obtain above-mentioned data included in each above-mentioned trooping and using as mean distance in above-mentioned group to a mean distance of above-mentioned central point.

In the face of user's demand and the Problems Existing of RSS information source, in order to allow user obtain better reading experience, we take artificial intelligence (Artificial intelligence) word and prospect (Text Mining) field as the data clustering method that proposed of basis, utilize file automatically to detect the technology of hiving off (ADTR) of pulling over and improve traditionally the automatic detection of (Clustering) algorithm in the parameter of trooping of hiving off, mixed and disorderly news can be hived off, to obtain simple and easy information fusion (RSS) source of different news but trooping of similar media event, thereby reach and promote the hive off result of accuracy of media event, in addition data clustering method proposed by the invention can assist to find out name important in news and potential important dictionary, difference along with news situation, also going for different situations changes, the performance that the mistake adaptive antijamming capability of dictionary has been had.In addition, compare by (Single-pass Clustering) mode of hiving off with single traditionally, singlely by the mode of hiving off, be one piece of article of single treatment, then go the comparison existing similarity of the trooping foundation of hiving off at present.Yet, data clustering method proposed by the invention, it is once for existing all data, carry out the foundation of whole silhouette style sheet and find the number of initially trooping that the file utilizing detects the technology of hiving off (ADTR) of pulling over automatically, then the algorithm of hiving off of pulling over of trooping.

Specific feature, structure or character that " embodiment " who mentions in this instructions or " embodiment " mention, can be included at least one embodiment of this instructions.Therefore, at the different local statements " in one embodiment " that occur, may not, all to refer to same embodiment.In addition, this specific feature, structure or character, also can be combined with one or more embodiment in any suitable manner.Moreover, should be noted that, the following drawings is only in order to help explanation, not illustrate according to actual ratio.

Although this instructions is to describe theme of the present invention by the disclosed embodiments, the disclosed embodiments are to protect claim of the present invention, not in order to limit scope of the present invention.Therefore, this instructions the disclosed embodiments, for any those skilled in the art, will be appreciated that above-mentioned advantage very soon.After reading description, any those skilled in the art, without departing from the spirit and scope of the present invention, mode that can broad sense is done suitable change and replacement.

Claims

1. a data grouping device, comprising:

One news database, in order to store a plurality of data;

One computing module, according to the distance relation between above-mentioned data, in order to set up a whole silhouette style sheet, then according to above-mentioned whole silhouette style sheet, obtains one and tentatively hives off with reference to number;

One module of hiving off, utilizes the algorithm of hiving off that a plurality of data are divided into a plurality of trooping with reference to number according to above-mentioned tentatively hiving off, then calculates mean distance in each above-mentioned a group of trooping, and

Whether one comparison module, be less than a threshold value in order to mean distance in more above-mentioned group, if wherein the interior mean distance of above-mentioned group is less than above-mentioned threshold value, above-mentioned the trooping of mean distance in the above-mentioned group of correspondence deposited in an event database.

2. data grouping device as claimed in claim 1, also comprises a pretreatment module, in order to by above-mentioned a plurality of data through a pre-treatment computing, so that a plurality of features of above-mentioned data are carried out to a vectorization processing, make above-mentioned data-switching become a spatial model.

3. data grouping device as claimed in claim 1, if wherein in above-mentioned group, mean distance is not less than above-mentioned threshold value, again pass above-mentioned the trooping of mean distance in the above-mentioned group of correspondence back above-mentioned computing module, to set up above-mentioned whole silhouette style sheet, and obtain, above-mentionedly tentatively hive off with reference to number.

4. data grouping device as claimed in claim 1, the step that wherein above-mentioned computing module is set up above-mentioned whole silhouette style sheet comprises:

According to the above-mentioned distance relation between above-mentioned data, utilize a silhouette formula to calculate a plurality of silhouette coefficients, to produce the corresponding one a plurality of whole silhouette value of different number of clusters of trooping number range, wherein above-mentioned number of clusters scope between 2 between the sum of above-mentioned data;

Record above-mentioned silhouette value to the above-mentioned integral body silhouette style sheet of hiving off; And

The peaked above-mentioned number of clusters of the above-mentioned whole silhouette value of correspondence is set as above-mentionedly tentatively hiving off with reference to number.

5. data grouping device as claimed in claim 1, the step that wherein the above-mentioned module of hiving off is calculated mean distance in each above-mentioned above-mentioned group who troops comprises:

Obtain a central point of above-mentioned data included in each above-mentioned trooping; And

Obtain above-mentioned data included in each above-mentioned trooping to a mean distance of above-mentioned central point, above-mentioned mean distance is mean distance in above-mentioned group.

6. data grouping device as claimed in claim 1, wherein the above-mentioned algorithm of hiving off is the hierarchy type algorithm of hiving off.

7. a data clustering method, comprises the following steps:

By a news database, obtain a plurality of data;

According to the distance relation between above-mentioned data, set up a whole silhouette style sheet, to obtain one, tentatively hive off with reference to number;

With reference to number, utilize the algorithm of hiving off that a plurality of data are divided into a plurality of trooping according to above-mentioned tentatively hiving off;

Obtain mean distance in each above-mentioned a group of trooping; And

In more above-mentioned group, whether mean distance is less than a threshold value, if wherein the interior mean distance of above-mentioned group is less than above-mentioned threshold value, deposits above-mentioned the trooping of mean distance in the above-mentioned group of correspondence in an event database.

8. data clustering method as claimed in claim 7, wherein, before setting up above-mentioned whole silhouette style sheet, also comprise, to above-mentioned data, carry out a pre-treatment computing so that a plurality of features of above-mentioned data are carried out to a vectorization processing, make above-mentioned data-switching become a spatial model.

9. data clustering method as claimed in claim 7, if wherein in above-mentioned group mean distance be not less than above-mentioned threshold value, mean distance in the above-mentioned group of correspondence above-mentioned being trooped re-establishes above-mentioned whole silhouette style sheet and above-mentionedly tentatively hives off with reference to number to obtain.

10. data clustering method as claimed in claim 7, the step of wherein setting up above-mentioned whole silhouette style sheet comprises:

Record above-mentioned silhouette value to above-mentioned unitary side shadow style sheet; And