CN107067045A - Data clustering method, device, computer-readable medium and electronic equipment - Google Patents

Data clustering method, device, computer-readable medium and electronic equipment Download PDF

Info

Publication number
CN107067045A
CN107067045A CN201710400066.3A CN201710400066A CN107067045A CN 107067045 A CN107067045 A CN 107067045A CN 201710400066 A CN201710400066 A CN 201710400066A CN 107067045 A CN107067045 A CN 107067045A
Authority
CN
China
Prior art keywords
data
distance
classification
cluster centre
existing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710400066.3A
Other languages
Chinese (zh)
Inventor
李树海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710400066.3A priority Critical patent/CN107067045A/en
Publication of CN107067045A publication Critical patent/CN107067045A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The invention provides a kind of data clustering method, device, computer-readable medium and electronic equipment.The data clustering method includes:Obtain data set to be clustered;Calculate the distance between the cluster centre of each data in the data set with having classification;If the distance between any data and the other cluster centre of existing any sort in the data set are less than or equal to distance threshold, any data is referred in any classification;If the distance between cluster centre of any data and existing all categories in the data set both greater than distance threshold, creates new classification, and any data is referred in the new classification.Technical scheme is when carrying out data clusters, without specifying cluster number and cluster centre in advance, avoid initial cluster center to choose wrong and produce harmful effect to final cluster result, while technical scheme can also reduce the time that data clusters process is spent.

Description

Data clustering method, device, computer-readable medium and electronic equipment
Technical field
The present invention relates to technical field of data processing, in particular to a kind of data clustering method, device, computer Computer-readable recording medium and electronic equipment.
Background technology
During user's portrait label model is built, after extracting user characteristics and standardizing characteristic, have Much based on the scene clustered into row label structure, such as promotion susceptibility cluster, comment susceptibility are clustered, consumer loyalty degree is poly- Class etc..Cluster is exactly under relative users feature, user to be collected according to some specific criteria and is divided into different class or cluster so that together User characteristics similitude in one class or cluster is as big as possible or apart from as small as possible, while the not use in same class or cluster Family feature difference is also as large as possible.In brief, of a sort data are brought together as far as possible after cluster, inhomogeneity number According to separation as far as possible.Need to realize the function of exactly clustering user in these scenes.
At present, it is general to use k-means clustering algorithms when carrying out data clusters.K-means algorithms receive parameter k, so N data object being previously entered is divided into k cluster with so that the cluster obtained is met afterwards:Object in same cluster Similarity is higher;And the object similarity in different clusters is smaller.
But, there is following defect in k-means algorithms:
(1) the number k of cluster centre needs to give in advance, but the selected of this k value is very difficult to estimate in practice , many times it is not aware that given data set should be divided into how many class special talents in advance most suitable;
(2), it is necessary to artificially determine initial cluster center in k-means algorithms, different initial cluster centers may Cause entirely different cluster result, once initial value selection is improper, possibly can not obtain effective cluster result;
(3) k-means algorithms are sensitive to exceptional value, it is impossible to detect outlier, and outlier is sometimes to cluster centre Accuracy rate has a significant impact;
(4) k-means algorithms need constantly to carry out sample classification adjustment, constantly calculate in the new cluster after adjustment The heart, convergence is relatively slow and cluster time complexity is higher, and when data volume is very big, the time overhead of algorithm is very big;
(5) k-means algorithms need to take multiple scan full dose data, can not be clustered for real time data.
Accordingly, it would be desirable to a kind of new data clusters scheme, at least to overcome above-mentioned one to a certain extent or many Individual problem.
It should be noted that information is only used for strengthening the reason of the background to the present invention disclosed in above-mentioned background section Solution, therefore can include not constituting the information to prior art known to persons of ordinary skill in the art.
The content of the invention
It is an object of the invention to provide a kind of data clustering method, device, computer-readable medium and electronic equipment, enter And one or more problem caused by limitation and the defect due to correlation technique is at least overcome to a certain extent.
Other characteristics and advantage of the present invention will be apparent from by following detailed description, or partially by the present invention Practice and acquistion.
According to the first aspect of the invention there is provided a kind of data clustering method, including:Obtain data set to be clustered; Calculate the distance between the cluster centre of each data in the data set with having classification;If any in the data set The distance between data and the other cluster centre of existing any sort are less than or equal to distance threshold, then return any data Class is into any classification;If any data in the data set and between the cluster centre of existing all categories away from From both greater than described distance threshold, then new classification is created, and any data is referred in the new classification.
In some embodiments of the invention, based on aforementioned schemes, in addition to:Any number in the data set is calculated According to before the distance between the cluster centre of existing classification, and will be described if in the absence of existing classification, creating new classification Any data is referred in the new classification.
In some embodiments of the invention, based on aforementioned schemes, in addition to:It is described any data is referred to After in any classification or the new classification, the cluster centre of any classification or the new classification is updated.
In some embodiments of the invention, based on aforementioned schemes, each data in the data set are calculated with having The step of the distance between cluster centre of classification, including:For each data in the data set, calculate described each successively The distance between the cluster centre of individual data with having classification.
In some embodiments of the invention, based on aforementioned schemes, each described data and existing classification are calculated successively The step of the distance between cluster centre, including:Calculate the cluster centre of the first data and existing classification in the data set The distance between;Distance between the cluster centre according to first data and existing classification is returned to first data Class, and after being updated to the cluster centre after classification, then calculate the cluster of the second data in the data set and existing classification The distance between center.
In some embodiments of the invention, based on aforementioned schemes, in addition to:Any number in the data set is calculated According to the distance between the cluster centre of existing classification during, however, it is determined that any data and existing any sort are other The distance between cluster centre is less than or equal to the distance threshold, then stops calculating any data and existing other classes The distance between other cluster centre.
In some embodiments of the invention, based on aforementioned schemes, any data in the data set is calculated with having The distance between cluster centre of classification, including:Calculate the poly- of any data in the data set and existing all categories The distance between class center, obtains the beeline between any data and the cluster centre of all categories;Judge Whether the beeline is less than or equal to the distance threshold;If the beeline is less than or equal to the distance threshold, Then it regard the corresponding classification of the beeline as any classification.
In some embodiments of the invention, based on aforementioned schemes, in addition to:The selected part data from the data set It is used as sample data sets;The distance between data two-by-two are calculated in the sample data sets, to obtain the sample data The distance set of set;According to the distance set, the probability for the two-dimentional Gaussian mixtures that the distance set is obeyed is determined Density function;The distance threshold is determined according to the probability density function.
In some embodiments of the invention, based on aforementioned schemes, the distance is determined according to the probability density function The step of threshold value, including:The minimum point of the probability density function is calculated, wherein, any in the probability density function The abscissa of point represents distance value;It regard the abscissa of the minimum point as the distance threshold.
According to the second aspect of the invention there is provided a kind of data clusters device, including:Acquiring unit, is treated for obtaining The data set of cluster;Between computing unit, the cluster centre for calculating each data and existing classification in the data set Distance;Processing unit, between any data in the data set and the other cluster centre of existing any sort When distance is less than or equal to distance threshold, any data is referred in any classification, and in the data When the distance between cluster centre of any data of concentration and existing all categories is both greater than the distance threshold, create new Classification, and any data is referred in the new classification.
According to the third aspect of the invention we there is provided a kind of computer-readable medium, computer program is stored thereon with, institute The data clustering method as described in above-mentioned first aspect is realized when the program of stating is executed by processor.
According to the fourth aspect of the invention there is provided a kind of electronic equipment, including:One or more processors;Storage dress Put, for storing one or more programs, when one or more of programs are by one or more of computing devices, make Obtain data clustering method of one or more of processors realizations as described in above-mentioned first aspect.
In the technical scheme that some embodiments of the present invention are provided, by it is determined that any data in data set with When the distance between cluster centre of existing all categories is both greater than distance threshold, new classification is created, and the data are returned Class is into the new classification of establishment, enabling neatly adjust actual poly- according to the actual cluster situation of data set to be clustered Class number.And the technical scheme of the embodiment of the present invention can create new classification when in the absence of existing classification so that enter During row data clusters, without given cluster number in advance, it is to avoid give inappropriate cluster number in advance and influence data to gather The result of class.Meanwhile, the technical scheme of the embodiment of the present invention specifies initial cluster center without artificial, it is to avoid initial clustering Choose wrong and harmful effect is produced to final cluster result in center.
, can be by the number in data set to be clustered in the technical scheme that some embodiments of the present invention are provided Data clusters operation can be completed according to single pass processing is carried out, the time that data clusters process is spent is effectively reduced.
In the technical scheme that some embodiments of the present invention are provided, pass through the probability according to two-dimentional Gaussian mixtures Density function determines distance threshold, enabling realize the choosing of distance threshold automatically according to the actual conditions of data set to be clustered Take, it is to avoid manually select distance threshold and cause time-consuming and choose inappropriate problem.
It should be appreciated that the general description of the above and detailed description hereinafter are only exemplary and explanatory, not Can the limitation present invention.
Brief description of the drawings
Accompanying drawing herein is merged in specification and constitutes the part of this specification, shows the implementation for meeting the present invention Example, and for explaining principle of the invention together with specification.It should be evident that drawings in the following description are only the present invention Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis These accompanying drawings obtain other accompanying drawings.In the accompanying drawings:
Fig. 1 diagrammatically illustrates the flow chart of data clustering method according to first embodiment of the invention;
Fig. 2 is shown calculates the signal of the distance between element two-by-two in data acquisition system according to an embodiment of the invention Figure;
Fig. 3 shows the probability density of the two-dimentional Gaussian mixtures that distance set is obeyed according to an embodiment of the invention The curve synoptic diagram of function;
Fig. 4 diagrammatically illustrates the flow chart of the data clustering method of second embodiment according to the present invention;
Fig. 5 diagrammatically illustrates the block diagram of data clusters device according to an embodiment of the invention;
Fig. 6 shows the structural representation of the computer system suitable for being used for the electronic equipment for realizing the embodiment of the present invention.
Embodiment
Example embodiment is described more fully with referring now to accompanying drawing.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein;On the contrary, thesing embodiments are provided so that the present invention will more Fully and completely, and by the design of example embodiment those skilled in the art is comprehensively conveyed to.
Implement in addition, described feature, structure or characteristic can be combined in any suitable manner one or more In example.Embodiments of the invention are fully understood so as to provide there is provided many details in the following description.However, It will be appreciated by persons skilled in the art that technical scheme can be put into practice without one or more in specific detail, Or can be using other methods, constituent element, device, step etc..In other cases, it is not shown in detail or describes known side Method, device, realization operate to avoid fuzzy each aspect of the present invention.
Block diagram shown in accompanying drawing is only functional entity, not necessarily must be corresponding with physically separate entity. I.e., it is possible to realize these functional entitys using software form, or realized in one or more hardware modules or integrated circuit These functional entitys, or realize in heterogeneous networks and/or processor device and/or microcontroller device these functional entitys.
Flow chart shown in accompanying drawing is merely illustrative, it is not necessary to including all contents and operation/step, It is not required to perform by described order.For example, some operation/steps can also be decomposed, and some operation/steps can be closed And or part merge, therefore the actual order performed is possible to be changed according to actual conditions.
Fig. 1 diagrammatically illustrates the flow chart of data clustering method according to first embodiment of the invention.
Reference picture 1, data clustering method according to first embodiment of the invention, including:
Step S10, obtains data set to be clustered.
Step S12, calculates the distance between the cluster centre of each data in the data set with having classification.
According to the exemplary embodiment of the present invention, step S12 is specifically included:For each data in data set, successively Calculate the distance between the cluster centre of each described data with having classification.
In an embodiment of the present invention, the distance between the cluster centre of each described data with having classification is calculated successively The step of, including:Calculate the distance between the cluster centre of the first data in the data set with having classification;According to institute The distance between the cluster centre of the first data with having classification is stated to sort out first data, and in the cluster after classification After the heart updates, then calculate the second data in the data set and have the distance between cluster centre of classification.
In this embodiment, after a certain data in data set are sorted out, in order to ensure subsequently to carry out cluster operation Accuracy, it is necessary to be updated to cluster centre.After cluster centre updates, then calculate other data in data set with The distance between cluster centre of existing classification.
Between any data in calculating data set and the cluster centre of existing classification apart from when, the present invention is proposed The following two kinds mode determines classification that any data should be referred to:
Mode one:
During distance between any data in calculating the data set and the cluster centre of existing classification, if Determine that the distance between any data and the other cluster centre of existing any sort are less than or equal to the distance threshold, then Stop calculating the distance between any data and the cluster centre of existing other classifications.
Can enter successively when calculating the distance between the cluster centre of any data with having classification in mode one Row is calculated or calculated simultaneously, but regardless of being how to be calculated, when it is determined that the data and the other cluster of any sort When the distance between center is less than or equal to distance threshold, just stop calculating the cluster centre of the data and existing other classifications The distance between.This mode can shorten the calculating time, improve cluster efficiency.
Mode two:
In some embodiments of the invention, based on aforementioned schemes, any data in the data set is calculated with having The distance between cluster centre of classification, including:Calculate the poly- of any data in the data set and existing all categories The distance between class center, obtains the beeline between any data and the cluster centre of all categories;Judge Whether the beeline is less than or equal to the distance threshold;If the beeline is less than or equal to the distance threshold, Then it regard the corresponding classification of the beeline as any classification.
Can enter successively when calculating the distance between the cluster centre of any data with having classification in mode two Row is calculated or calculated simultaneously, but regardless of being how to be calculated, is required for calculating the data and all categories The distance between cluster centre, then select minimum range, if the minimum range is less than or equal to distance threshold, select The corresponding classification of the minimum range is sorted out.This mode can find more suitably classification and be clustered, and improve poly- The degree of accuracy of class.
With continued reference to Fig. 1, described data clustering method also includes:
Step S14, if the distance between any data in the data set and the other cluster centre of existing any sort Less than or equal to distance threshold, then any data is referred in any classification.
Step S16, if the distance between any data in the data set and cluster centre of existing all categories Both greater than described distance threshold, then create new classification, and any data is referred in the new classification.
In addition, described data clustering method can also include:Any data in the data set is calculated is with having Before the distance between cluster centre of classification, if in the absence of existing classification, create new classification, and by any data It is referred in the new classification.
It should be noted that after any data is referred in any classification or the new classification, Update the cluster centre of any classification or the new classification.
For above-mentioned distance threshold, the embodiments of the invention provide following computational methods:
In an embodiment of the present invention, selected part data are used as sample data sets from the data set;Calculate institute The distance between data two-by-two are stated in sample data sets, to obtain the distance set of the sample data sets;According to described Distance set, determines the probability density function for the two-dimentional Gaussian mixtures that the distance set is obeyed;It is close according to the probability Degree function determines the distance threshold.
In some embodiments of the invention, based on aforementioned schemes, the distance is determined according to the probability density function The step of threshold value, including:The minimum point of the probability density function is calculated, wherein, any in the probability density function The abscissa of point represents distance value;It regard the abscissa of the minimum point as the distance threshold.
In order to which the technical scheme of the embodiment of the present invention is expanded on further, below in conjunction with Fig. 2 to Fig. 4 to the embodiment of the present invention Technical scheme is described in detail:
The basic thought of the embodiment of the present invention is:For a data acquisition system, the distance between data in same classification (i.e. inter- object distance) is smaller, and similarity is larger;The distance between different classes of data (i.e. between class distance) are larger, similarity compared with It is small.It therefore, it can set a distance threshold Th, if the distance between two data points are less than the threshold value, the two counted Strong point is classified as a class.
Formula is expressed as:
I.e. for each data point Sj in classification G, if the distance between Si and Sj both less than a certain distance threshold values Th, then be classified as the data point in classification G by Si.
Under normal circumstances, it can observe cluster result to adjust distance threshold by repetition test;But because different are gathered Data characteristics differs greatly under class scene, and the distance threshold being applicable under some scene is not particularly suited for other scenes;And In the case where data volume is big, the process of manual setting distance threshold is relatively time-consuming.In order to solve the problem, how to choose automatically Distance threshold (similarity threshold) turns into the key of algorithm.
Two-dimentional mixed Gaussian point is obeyed by testing the distance between each data point set in discovery, data acquisition system repeatedly Cloth, and (its abscissa is represented the abscissa of the minimum point of the probability density function curve of corresponding two-dimentional Gaussian mixtures Distance value) distance threshold clustered can be used as.It can simply be interpreted as that inter- object distance set obeys that an average is small, variance is big Average is big, variance is a small normal distribution is obeyed in normal distribution, between class distance set, two probability density letters for being just distributed very much The intersection point of number curve is minimum point.
In the case where input data amount is big, the amount of calculation needed for calculating distance between any two is too big, under normal circumstances Design conditions do not allow, therefore consider to randomly select part input data, then calculate mutual distance between them as distance set Close.
In an exemplary embodiment of the present invention, if input data set is combined into data [n], intelligence chooses the tool of distance threshold Body step is as follows:
(1) part sample set is randomly selected, quantity is designated as m;
(2) distance in sample data sets between any two is calculated, the distance between data [i] and data [j] is set to d [i] [j], the distance set of sample between any two is d [m] [m];
(3) d [m] [m] is carried out parameter Estimation based on input data set in MATLAB and simulated automatically as input Go out the probability density function f of the two-dimentional Gaussian mixtures of data acquisition system obedience;
(4) it regard the abscissa of probability density function f minimum point as distance threshold Th.
As shown in Fig. 2 for a data acquisition system, the distance of element between any two in set of computations makees these distances For a distance set.The distance set obeys two-dimentional Gaussian mixtures, corresponding probability density function curve such as Fig. 3 institutes Show, the abscissa of the minimum point of curve can be used as distance threshold.
After distance threshold is selected, just data set can be clustered, the distance between data point is less than distance The element of threshold value is classified as a class.Cluster just can be completed because the clustering algorithm in the embodiment of the present invention need to only scan pass evidence, Therefore can extend for streaming cluster, idiographic flow is as shown in figure 4, comprise the following steps:
Step S402, is initialized as sky by current class set G, distance threshold is designated as into Th.It should be noted that to class After not set G initialization, the classification inside category set G can be 0, can also preset several classifications.
Step S404, successively scan data concentrate each element.
Step S406, obtains the element in data set.
Step S408, if element in the data acquisition system end of scan, algorithm terminates;Otherwise step S410 is performed.
Step S410, calculates currentElement and the distance at each categorical clusters center in current class set G, will wherein most Small distance is designated as d, and corresponding classification is designated as g.
Step S412, judges d≤Th and whether G does not set up for sky, if so, then performing step S414;Otherwise step is performed S416。
Step S414, classification g, and more new category g cluster centre are referred to by currentElement.It is then back to step S406。
Step S416, a newly-built classification h, classification h, and more new category h cluster centre are referred to by currentElement, will Newly-built classification h is included into G.It is then back to step S406.
It should be noted that being to calculate currentElement with each classification in category set G to gather in the flow shown in Fig. 4 The distance at class center, determines minimum range therein, if the minimum range is less than or equal to distance threshold Th, by currentElement It is referred in classification g.
In other embodiments of the invention, currentElement and each categorical clusters in category set G can also calculated Center apart from when, using first distance of determination be less than or equal to distance threshold Th classification will sort out as currentElement The classification arrived, can so shorten the calculating time.
Based on the embodiment shown in Fig. 4, it is assumed that the data volume of input data set is n, then embodiment shown in Fig. 4 Algorithms T-cbmplexity is O (n*logn), wherein, n is record number, and logn is the classification number after clustering under normal circumstances. And the time complexity of k-means algorithms is O (k*n*t), wherein, k is the number (the classification number after clustering) of cluster, and n is Number is recorded, t is iterations.It can be seen that, the technical scheme of the embodiment of the present invention is compared to k-means algorithms, its time complexity Decline to a great extent, it is with the obvious advantage under big data quantity scene.
To sum up, the data clusters scheme of the embodiment of the present invention has the following advantages that:
(1) Data Clustering Algorithm that the embodiment of the present invention is proposed without giving cluster number in advance, without artificial determination Initial cluster center, but intelligently determine cluster number by automatically analyzing data distribution situation.Simultaneously without given initial Cluster centre, it is to avoid the harmful effect in the case where initial cluster center chooses error to final cluster result.
(2) relative to other clustering algorithms, the classification generated after the completion of being clustered under normal circumstances using this algorithm is more, Because have the data point number that some classifications are included seldom, under most scenes on these classifications can be considered as outlier Classification, this creates the terminal the effect of automatic detection outlier, and comprising the more class special talent of data point as being of practical significance Classification.
(3) it is determined that after distance threshold, this algorithm is without iteration, and one pass of scanning is multiple according to that can complete cluster, time Miscellaneous degree is reduced to O (n*logn) by the O (k*n*t) of k-means algorithms, with the obvious advantage under the scene of big data quantity.Further, since This algorithm need to only scan a pass evidence, therefore the expansible streaming that is used for is clustered, i.e., real time data is clustered so that algorithm application Scope is wider.
Fig. 5 diagrammatically illustrates the block diagram of data clusters device according to an embodiment of the invention.
Reference picture 5, data clusters device 500 according to an embodiment of the invention, including:Acquiring unit 502, computing unit 504 and processing unit 506.
Specifically, acquiring unit 502 is used to obtain data set to be clustered;Computing unit 504 is used to calculate the data The distance between the cluster centre of each data concentrated with having classification;Processing unit 506 is used in the data set When the distance between any data and the other cluster centre of existing any sort are less than or equal to distance threshold, by any number According to being referred in any classification, and for any data in the data set with the cluster of existing all categories When the distance between heart is both greater than the distance threshold, new classification is created, and any data is referred to described new In classification.
In some embodiments of the invention, based on aforementioned schemes, the processing unit 506 is additionally operable to:Calculated described Unit 504 is calculated before the distance between the cluster centre of any data with having classification in the data set, if being not present Existing classification, then create new classification, and any data is referred in the new classification.
In some embodiments of the invention, based on aforementioned schemes, in addition to:Updating block (not shown in Fig. 5), is used for After any data is referred in any classification or the new classification, any classification or described is updated The cluster centre of new classification.
In some embodiments of the invention, based on aforementioned schemes, the computing unit 504 is configured to:For the number According to each data of concentration, the distance between the cluster centre of each described data with having classification is calculated successively.
In some embodiments of the invention, based on aforementioned schemes, the computing unit 504 is configured to:Calculate the number The distance between cluster centre according to the first data of concentration with having classification;According to first data and existing classification The distance between cluster centre is sorted out to first data, and after being updated to the cluster centre after classification, then calculate described The distance between the cluster centre of the second data with having classification in data set.
In some embodiments of the invention, based on aforementioned schemes, the computing unit 504 is additionally operable to:It is described calculating During the distance between any data in data set and cluster centre of existing classification, however, it is determined that any data with The distance between existing other cluster centre of any sort is less than or equal to the distance threshold, then stops calculating any number According to the distance between with the cluster centre of existing other classifications.
In some embodiments of the invention, based on aforementioned schemes, the computing unit 504 is configured to:Calculate the number The distance between cluster centre according to any data and the existing all categories of concentration, obtains any data and the institute There is the beeline between the cluster centre of classification;Judge whether the beeline is less than or equal to the distance threshold;If The beeline is less than or equal to the distance threshold, then regard the corresponding classification of the beeline as any sort Not.
In some embodiments of the invention, based on aforementioned schemes, in addition to:Unit is chosen, for from the data set Middle selected part data are used as sample data sets.The computing unit 504 is additionally operable to, and is calculated two in the sample data sets The distance between two data, to obtain the distance set of the sample data sets;The processing unit 506 is additionally operable to, according to The distance set, determines the probability density function for the two-dimentional Gaussian mixtures that the distance set is obeyed, and according to described Probability density function determines the distance threshold.
In some embodiments of the invention, based on aforementioned schemes, the processing unit 506 is configured to:Calculate described general The minimum point of rate density function, wherein, the abscissa of any point in the probability density function represents distance value;Will be described The abscissa of minimum point is used as the distance threshold.
Below with reference to Fig. 6, it illustrates suitable for for the computer system 600 for the electronic equipment for realizing the embodiment of the present invention Structural representation.The computer system 600 of electronic equipment shown in Fig. 6 is only an example, should not be to the embodiment of the present invention Function and use range band come any limitation.
As shown in fig. 6, computer system 600 includes CPU (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into program in random access storage device (RAM) 603 from storage part 608 and Perform various appropriate actions and processing.In RAM 603, various programs and data needed for the system operatio that is also stored with.CPU 601st, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to bus 604。
I/O interfaces 605 are connected to lower component:Importation 606 including keyboard, mouse etc.;Penetrated including such as negative electrode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage part 608 including hard disk etc.; And the communications portion 609 of the NIC including LAN card, modem etc..Communications portion 609 via such as because The network of spy's net performs communication process.Driver 610 is also according to needing to be connected to I/O interfaces 605.Detachable media 611, such as Disk, CD, magneto-optic disk, semiconductor memory etc., are arranged on driver 610, in order to read from it as needed Computer program be mounted into as needed storage part 608.
Especially, embodiments in accordance with the present invention, the process described above with reference to flow chart may be implemented as computer Software program.For example, embodiments of the invention include a kind of computer program product, it includes being carried on computer-readable medium On computer program, the computer program include be used for execution flow chart shown in method program code.In such reality Apply in example, the computer program can be downloaded and installed by communications portion 609 from network, and/or from detachable media 611 are mounted.When the computer program is performed by CPU (CPU) 601, limited in the system for performing the application Above-mentioned functions.
It should be noted that the computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer-readable recording medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than combination.Meter The more specifically example of calculation machine readable storage medium storing program for executing can include but is not limited to:Electrical connection with one or more wires, just Take formula computer disk, hard disk, random access storage device (RAM), read-only storage (ROM), erasable type and may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer-readable recording medium can any include or store journey The tangible medium of sequence, the program can be commanded execution system, device or device and use or in connection.And at this In invention, computer-readable signal media can be included in a base band or as the data-signal of carrier wave part propagation, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limit In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium beyond storage medium is read, the computer-readable medium, which can send, propagates or transmit, to be used for Used by instruction execution system, device or device or program in connection.Included on computer-readable medium Program code can be transmitted with any appropriate medium, be included but is not limited to:Wirelessly, electric wire, optical cable, RF etc., or above-mentioned Any appropriate combination.
Flow chart and block diagram in accompanying drawing, it is illustrated that according to the system of various embodiments of the invention, method and computer journey Architectural framework in the cards, function and the operation of sequence product.At this point, each square frame in flow chart or block diagram can generation The part of one module of table, program segment or code, a part for above-mentioned module, program segment or code is comprising one or more Executable instruction for realizing defined logic function.It should also be noted that in some realizations as replacement, institute in square frame The function of mark can also be with different from the order marked in accompanying drawing generation.For example, two square frames succeedingly represented are actual On can perform substantially in parallel, they can also be performed in the opposite order sometimes, and this is depending on involved function.Also It is noted that the combination of each square frame in block diagram or flow chart and the square frame in block diagram or flow chart, can use and perform rule Fixed function or the special hardware based system of operation realize, or can use the group of specialized hardware and computer instruction Close to realize.
Being described in unit involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part realizes that described unit can also set within a processor.Wherein, the title of these units is in certain situation Under do not constitute restriction to the unit in itself.
As on the other hand, present invention also provides a kind of computer-readable medium, the computer-readable medium can be Included in electronic equipment described in above-described embodiment;Can also be individualism, and without be incorporated the electronic equipment in. Above computer computer-readable recording medium carries one or more program, and when said one or multiple programs, by one, the electronics is set During standby execution so that the electronic equipment realizes the data clustering method as described in above-mentioned embodiment.
For example, it is possible to achieve as shown in Figure 1:Step S10, obtains data set to be clustered;Step S12, calculates institute State the distance between the cluster centre of each data in data set with having classification;Step S14, if appointing in the data set The distance between one data and the other cluster centre of existing any sort are less than or equal to distance threshold, then by any data It is referred in any classification;Step S16, if any data in the data set with the cluster of existing all categories The distance between heart is both greater than the distance threshold, then creates new classification, and any data is referred to described new In classification.And for example, it is possible to achieve each step as shown in Figure 4.
Although it should be noted that being referred to some modules or list of the equipment for action executing in above-detailed Member, but this division is not enforceable.In fact, according to the embodiment of the present invention, it is above-described two or more Module or the feature and function of unit can embody in a module or unit.Conversely, an above-described mould The feature and function of block or unit can be further divided into being embodied by multiple modules or unit.
Through the above description of the embodiments, those skilled in the art is it can be readily appreciated that example described herein is implemented Mode can be realized by software, can also be realized by way of software combines necessary hardware.Therefore, according to the present invention The technical scheme of embodiment can be embodied in the form of software product, the software product can be stored in one it is non-volatile Property storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.) in or network on, including some instructions are to cause a calculating Equipment (can be personal computer, server, touch control terminal or network equipment etc.) is performed according to embodiment of the present invention Method.
Those skilled in the art will readily occur to its of the present invention after considering specification and putting into practice invention disclosed herein Its embodiment.The application be intended to the present invention any modification, purposes or adaptations, these modifications, purposes or Person's adaptations follow the general principle of the present invention and including undocumented common knowledge in the art of the invention Or conventional techniques.Description and embodiments are considered only as exemplary, and true scope and spirit of the invention are by following Claim is pointed out.
It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims (12)

1. a kind of data clustering method, it is characterised in that including:
Obtain data set to be clustered;
Calculate the distance between the cluster centre of each data in the data set with having classification;
If the distance between any data in the data set and the other cluster centre of existing any sort be less than or equal to away from From threshold value, then any data is referred in any classification;
If any data in the data set and the distance between cluster centre of existing all categories it is both greater than described away from From threshold value, then new classification is created, and any data is referred in the new classification.
2. data clustering method according to claim 1, it is characterised in that also include:
Before distance between any data in calculating the data set and the cluster centre of existing classification, if in the absence of There is classification, then create new classification, and any data is referred in the new classification.
3. data clustering method according to claim 1, it is characterised in that also include:
After any data is referred in any classification or the new classification, update any classification or The cluster centre of the new classification.
4. data clustering method according to claim 1, it is characterised in that calculate each data in the data set with The step of the distance between cluster centre of existing classification, including:
For each data in the data set, between the cluster centre that each described data and existing classification are calculated successively Distance.
5. data clustering method according to claim 4, it is characterised in that calculate each described data and existing class successively The step of the distance between other cluster centre, including:
Calculate the distance between the cluster centre of the first data in the data set with having classification;
Distance between the cluster centre according to first data and existing classification is sorted out to first data, and to returning After cluster centre after class updates, then calculate between the second data in the data set and the cluster centre for having classification Distance.
6. data clustering method according to claim 1, it is characterised in that also include:
During distance between any data in calculating the data set and the cluster centre of existing classification, however, it is determined that The distance between any data and the other cluster centre of existing any sort are less than or equal to the distance threshold, then stop Calculate the distance between any data and the cluster centre of existing other classifications.
7. data clustering method according to claim 1, it is characterised in that calculate any data in the data set with The distance between cluster centre of existing classification, including:
The distance between cluster centre of any data and existing all categories in the data set is calculated, described appoint is obtained Beeline between the cluster centre of one data and all categories;
Judge whether the beeline is less than or equal to the distance threshold;
If the beeline is less than or equal to the distance threshold, the corresponding classification of the beeline is regard as described One classification.
8. data clustering method according to any one of claim 1 to 7, it is characterised in that also include:
Selected part data are used as sample data sets from the data set;
The distance between data two-by-two are calculated in the sample data sets, to obtain the distance set of the sample data sets Close;
According to the distance set, the probability density function for the two-dimentional Gaussian mixtures that the distance set is obeyed is determined;
The distance threshold is determined according to the probability density function.
9. data clustering method according to claim 8, it is characterised in that according to being determined the probability density function The step of distance threshold, including:
The minimum point of the probability density function is calculated, wherein, the abscissa table of any point in the probability density function Show distance value;
It regard the abscissa of the minimum point as the distance threshold.
10. a kind of data clusters device, it is characterised in that including:
Acquiring unit, the data set to be clustered for obtaining;
Computing unit, for calculating the distance between the cluster centre of each data in the data set with having classification;
Processing unit, for the distance between any data in the data set and the other cluster centre of existing any sort During less than or equal to distance threshold, any data is referred in any classification, and in the data set The distance between the cluster centre of any data and existing all categories both greater than the distance threshold when, create new class Not, and by any data it is referred in the new classification.
11. a kind of computer-readable medium, is stored thereon with computer program, it is characterised in that described program is held by processor Data clustering method as claimed in any one of claims 1-9 wherein is realized during row.
12. a kind of electronic equipment, it is characterised in that including:
One or more processors;
Storage device, for storing one or more programs, when one or more of programs are by one or more of processing When device is performed so that one or more of processors realize data clusters side as claimed in any one of claims 1-9 wherein Method.
CN201710400066.3A 2017-05-31 2017-05-31 Data clustering method, device, computer-readable medium and electronic equipment Pending CN107067045A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710400066.3A CN107067045A (en) 2017-05-31 2017-05-31 Data clustering method, device, computer-readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710400066.3A CN107067045A (en) 2017-05-31 2017-05-31 Data clustering method, device, computer-readable medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN107067045A true CN107067045A (en) 2017-08-18

Family

ID=59615433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710400066.3A Pending CN107067045A (en) 2017-05-31 2017-05-31 Data clustering method, device, computer-readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN107067045A (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729489A (en) * 2017-10-17 2018-02-23 北京京东尚科信息技术有限公司 Advertisement text recognition methods and device
CN107798354A (en) * 2017-11-16 2018-03-13 腾讯科技(深圳)有限公司 A kind of picture clustering method, device and storage device based on facial image
CN108229419A (en) * 2018-01-22 2018-06-29 百度在线网络技术(北京)有限公司 For clustering the method and apparatus of image
CN108460397A (en) * 2017-12-26 2018-08-28 东软集团股份有限公司 Analysis method, device, storage medium and the electronic equipment of equipment fault type
CN109324264A (en) * 2018-10-24 2019-02-12 中国电力科学研究院有限公司 A kind of discrimination method and device of distribution network line impedance data exceptional value
CN109508087A (en) * 2018-09-25 2019-03-22 易念科技(深圳)有限公司 Brain line signal recognition method and terminal device
CN109934302A (en) * 2019-03-23 2019-06-25 大国创新智能科技(东莞)有限公司 New category recognition methods and robot system based on fuzzy theory and deep learning
WO2019119635A1 (en) * 2017-12-18 2019-06-27 平安科技(深圳)有限公司 Seed user development method, electronic device and computer-readable storage medium
CN110378843A (en) * 2018-11-13 2019-10-25 北京京东尚科信息技术有限公司 Data filtering methods and device
CN110414569A (en) * 2019-07-03 2019-11-05 北京小米智能科技有限公司 Cluster realizing method and device
CN110503117A (en) * 2018-05-16 2019-11-26 北京京东尚科信息技术有限公司 The method and apparatus of data clusters
CN110930541A (en) * 2019-11-04 2020-03-27 洛阳中科晶上智能装备科技有限公司 Method for analyzing working condition state of agricultural machine by using GPS information
CN111158883A (en) * 2019-12-31 2020-05-15 青岛海尔科技有限公司 Method and device for operating system task classification and computer
CN111428035A (en) * 2020-03-23 2020-07-17 北京明略软件系统有限公司 Entity clustering method and device
CN111507400A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Application classification method and device, electronic equipment and storage medium
CN112215287A (en) * 2020-10-13 2021-01-12 中国光大银行股份有限公司 Distance-based multi-section clustering method and device, storage medium and electronic device
CN112381163A (en) * 2020-11-20 2021-02-19 平安科技(深圳)有限公司 User clustering method, device and equipment
CN112580677A (en) * 2019-09-29 2021-03-30 北京地平线机器人技术研发有限公司 Point cloud data point classification method and device
CN112598041A (en) * 2020-12-17 2021-04-02 武汉大学 Power distribution network cloud platform data verification method based on K-MEANS algorithm
CN112766403A (en) * 2020-12-29 2021-05-07 广东电网有限责任公司电力科学研究院 Incremental clustering method and device based on information gain weight
CN113282446A (en) * 2021-04-07 2021-08-20 广州汇通国信科技有限公司 Log data collection method and system based on multi-granularity filtering
CN115376705A (en) * 2022-10-24 2022-11-22 北京京东拓先科技有限公司 Method and device for analyzing drug specification
CN116204800A (en) * 2022-11-30 2023-06-02 北京码牛科技股份有限公司 Controllable clustering method, system, terminal and storage medium for position point division

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537380A (en) * 2014-12-30 2015-04-22 小米科技有限责任公司 Clustering method and device
CN105913001A (en) * 2016-04-06 2016-08-31 南京邮电大学盐城大数据研究院有限公司 On-line type multi-face image processing method based on clustering
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537380A (en) * 2014-12-30 2015-04-22 小米科技有限责任公司 Clustering method and device
CN105913001A (en) * 2016-04-06 2016-08-31 南京邮电大学盐城大数据研究院有限公司 On-line type multi-face image processing method based on clustering
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729489A (en) * 2017-10-17 2018-02-23 北京京东尚科信息技术有限公司 Advertisement text recognition methods and device
CN107798354A (en) * 2017-11-16 2018-03-13 腾讯科技(深圳)有限公司 A kind of picture clustering method, device and storage device based on facial image
CN107798354B (en) * 2017-11-16 2022-11-01 腾讯科技(深圳)有限公司 Image clustering method and device based on face image and storage equipment
WO2019119635A1 (en) * 2017-12-18 2019-06-27 平安科技(深圳)有限公司 Seed user development method, electronic device and computer-readable storage medium
CN108460397A (en) * 2017-12-26 2018-08-28 东软集团股份有限公司 Analysis method, device, storage medium and the electronic equipment of equipment fault type
CN108229419B (en) * 2018-01-22 2022-03-04 百度在线网络技术(北京)有限公司 Method and apparatus for clustering images
CN108229419A (en) * 2018-01-22 2018-06-29 百度在线网络技术(北京)有限公司 For clustering the method and apparatus of image
CN110503117A (en) * 2018-05-16 2019-11-26 北京京东尚科信息技术有限公司 The method and apparatus of data clusters
CN109508087A (en) * 2018-09-25 2019-03-22 易念科技(深圳)有限公司 Brain line signal recognition method and terminal device
CN109324264B (en) * 2018-10-24 2023-07-18 中国电力科学研究院有限公司 Identification method and device for abnormal value of power distribution network line impedance data
CN109324264A (en) * 2018-10-24 2019-02-12 中国电力科学研究院有限公司 A kind of discrimination method and device of distribution network line impedance data exceptional value
CN110378843A (en) * 2018-11-13 2019-10-25 北京京东尚科信息技术有限公司 Data filtering methods and device
CN109934302A (en) * 2019-03-23 2019-06-25 大国创新智能科技(东莞)有限公司 New category recognition methods and robot system based on fuzzy theory and deep learning
CN110414569A (en) * 2019-07-03 2019-11-05 北京小米智能科技有限公司 Cluster realizing method and device
US11501099B2 (en) 2019-07-03 2022-11-15 Beijing Xiaomi Intelligent Technology Co., Ltd. Clustering method and device
CN112580677A (en) * 2019-09-29 2021-03-30 北京地平线机器人技术研发有限公司 Point cloud data point classification method and device
CN110930541A (en) * 2019-11-04 2020-03-27 洛阳中科晶上智能装备科技有限公司 Method for analyzing working condition state of agricultural machine by using GPS information
CN111158883B (en) * 2019-12-31 2023-11-28 青岛海尔科技有限公司 Method, device and computer for classifying tasks of operating system
CN111158883A (en) * 2019-12-31 2020-05-15 青岛海尔科技有限公司 Method and device for operating system task classification and computer
CN111428035A (en) * 2020-03-23 2020-07-17 北京明略软件系统有限公司 Entity clustering method and device
CN111507400A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Application classification method and device, electronic equipment and storage medium
CN111507400B (en) * 2020-04-16 2023-10-31 腾讯科技(深圳)有限公司 Application classification method, device, electronic equipment and storage medium
CN112215287A (en) * 2020-10-13 2021-01-12 中国光大银行股份有限公司 Distance-based multi-section clustering method and device, storage medium and electronic device
CN112215287B (en) * 2020-10-13 2024-04-12 中国光大银行股份有限公司 Multi-section clustering method and device based on distance, storage medium and electronic device
CN112381163A (en) * 2020-11-20 2021-02-19 平安科技(深圳)有限公司 User clustering method, device and equipment
CN112381163B (en) * 2020-11-20 2023-07-25 平安科技(深圳)有限公司 User clustering method, device and equipment
CN112598041A (en) * 2020-12-17 2021-04-02 武汉大学 Power distribution network cloud platform data verification method based on K-MEANS algorithm
CN112766403A (en) * 2020-12-29 2021-05-07 广东电网有限责任公司电力科学研究院 Incremental clustering method and device based on information gain weight
CN113282446A (en) * 2021-04-07 2021-08-20 广州汇通国信科技有限公司 Log data collection method and system based on multi-granularity filtering
CN115376705A (en) * 2022-10-24 2022-11-22 北京京东拓先科技有限公司 Method and device for analyzing drug specification
CN116204800A (en) * 2022-11-30 2023-06-02 北京码牛科技股份有限公司 Controllable clustering method, system, terminal and storage medium for position point division

Similar Documents

Publication Publication Date Title
CN107067045A (en) Data clustering method, device, computer-readable medium and electronic equipment
Lv et al. Generative adversarial networks for parallel transportation systems
CN110349147B (en) Model training method, fundus macular region lesion recognition method, device and equipment
Li et al. IBEA-SVM: an indicator-based evolutionary algorithm based on pre-selection with classification guided by SVM
CN110008259A (en) The method and terminal device of visualized data analysis
CN104933428B (en) A kind of face identification method and device based on tensor description
CN110457403A (en) The construction method of figure network decision system, method and knowledge mapping
CN106033425A (en) A data processing device and a data processing method
CN109344969B (en) Neural network system, training method thereof, and computer-readable medium
CN110084175A (en) A kind of object detection method, object detecting device and electronic equipment
CN109710760A (en) Clustering method, device, medium and the electronic equipment of short text
Cuevas et al. Evolutionary computation techniques: a comparative perspective
CN114925938B (en) Electric energy meter running state prediction method and device based on self-adaptive SVM model
CN110110663A (en) A kind of age recognition methods and system based on face character
CN107688783A (en) 3D rendering detection method, device, electronic equipment and computer-readable medium
CN110796482A (en) Financial data classification method and device for machine learning model and electronic equipment
CN112420125A (en) Molecular attribute prediction method and device, intelligent equipment and terminal
CN110097098A (en) Data classification method and device, medium and electronic equipment based on base classifier
CN110188763A (en) A kind of image significance detection method based on improvement graph model
KR102039244B1 (en) Data clustering method using firefly algorithm and the system thereof
CN113962401A (en) Federal learning system, and feature selection method and device in federal learning system
Hamedmoghadam et al. A global optimization approach based on opinion formation in complex networks
CN108509876A (en) For the object detecting method of video, device, equipment, storage medium and program
Xiang et al. Optical flow estimation using spatial-channel combinational attention-based pyramid networks
CN110071845A (en) The method and device that a kind of pair of unknown applications are classified

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170818