CN106874367A - A kind of sampling distribution formula clustering method based on public sentiment platform - Google Patents

A kind of sampling distribution formula clustering method based on public sentiment platform Download PDF

Info

Publication number
CN106874367A
CN106874367A CN201611260883.5A CN201611260883A CN106874367A CN 106874367 A CN106874367 A CN 106874367A CN 201611260883 A CN201611260883 A CN 201611260883A CN 106874367 A CN106874367 A CN 106874367A
Authority
CN
China
Prior art keywords
data
cluster
sampling
clustering method
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611260883.5A
Other languages
Chinese (zh)
Inventor
汪伟亚
许恺
黄强松
陈辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu One Hundred Information Service Co Ltd
Original Assignee
Jiangsu One Hundred Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu One Hundred Information Service Co Ltd filed Critical Jiangsu One Hundred Information Service Co Ltd
Priority to CN201611260883.5A priority Critical patent/CN106874367A/en
Publication of CN106874367A publication Critical patent/CN106874367A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention provides a kind of sampling distribution formula clustering method based on public sentiment platform, and comprise the following steps:First, obtaining data to be clustered, and treat cluster data carries out burst treatment, obtains multiple bursts;2nd, sampling of data is carried out using each burst of Map function pairs in MapReduce;3rd, the data from the sample survey that will be obtained collects, and the data from the sample survey for collecting is clustered during the Reduce of MapReduce frameworks;4th, it is repeated in step 2 and the total r that carries out of step 3 takes turns sampling of data, the cluster result of the data from the sample survey of each round is denoted as base cluster, and obtain Π={ π 1, π 2 ..., π r } vector, wherein, r is the positive integer more than or equal to 2, and π i are the base cluster of the i-th wheel, 1≤i≤r, and be positive integer;5th, it is final cluster result to reuse MapReduce frameworks by base clustering ensemble.The sampling distribution formula clustering method based on public sentiment platform can effectively improve the cluster efficiency of mass data and improve data diversity while data scale is reduced.

Description

A kind of sampling distribution formula clustering method based on public sentiment platform
Technical field
The invention belongs to data mining and machine learning field, more particularly to a kind of sampling distribution based on public sentiment platform Formula clustering method.
Background technology
Data clusters problem, is that it is operated by the similitude between data sample point, makes similarity high Data sample point is in same class cluster, and the relatively low sample point of similarity is away from each other.Cluster is all the time data mining One of with the important method in machine learning, but the user's original content brought with the development particularly Web2.0 of internet Explosive growth, data volume has turned into the bottleneck of traditional clustering method, especially news recommendation, machine translation, literature search, feelings The text data of the application fields such as analysis, public sentiment monitoring of calling the score, with the characteristic that higher-dimension is sparse.Clustering algorithm how is improved to be particularly The efficiency of the clustering method of high dimension sparse data, it has also become internet big data data mining major issue urgently to be resolved hurrily.
Therefore, it is necessary to provide a kind of efficiency of the clustering method that can improve high dimension sparse data based on public sentiment platform Sampling distribution formula clustering method.
The content of the invention
It is an object of the invention to provide a kind of efficiency of the clustering method that can improve high dimension sparse data based on carriage The sampling distribution formula clustering method of feelings platform.
Technical scheme is as follows:A kind of sampling distribution formula clustering method based on public sentiment platform includes following step Suddenly:First, data to be clustered are obtained, and burst treatment is carried out to the data to be clustered, obtain multiple bursts;2nd, utilize The each burst of Map function pairs in MapReduce carries out sampling of data;3rd, the data from the sample survey that will be obtained collects, and The data from the sample survey for collecting is clustered during the Reduce of MapReduce frameworks;4th, step 2 and step are repeated in Rapid three it is total carry out r wheel sampling of datas, the cluster result of the data from the sample survey of each round is denoted as base cluster, and obtain Π=π 1, π 2 ..., π r } vector, wherein, r is the positive integer more than or equal to 2, and π i are the base cluster of the i-th wheel, 1≤i≤r, and for just whole Number;5th, it is final cluster result to reuse MapReduce frameworks by the base clustering ensemble.
Preferably, in step one, horizontal segmentation is carried out to the data to be clustered, and ensure every in cutting procedure The integrality of data, and the burst storage that segmentation is obtained is in distributed file system.
Preferably, carry out that sampling of data at least meets in the step 2 requires to include:Sampling techniques letter enough in itself Single, sampling carries out having certain randomness with sampling results based on local data.
Preferably, in step 3, using specific sampling of data round as key, the data from the sample survey conduct for obtaining Value, in converging to a Reduce function of MapReduce by shuffle functions, to taking out in the Reduce functions Sample data are clustered.
Preferably, comprise the following steps in step 5:The a number of base cluster is randomly choosed as barycenter, and The distance between other bases clusters and the barycenter is calculated with Map functions, each base cluster is assigned to and its distance In class cluster where the nearest barycenter, and the barycenter of class cluster is updated in Reduce functions;This process is repeated until institute The barycenter for stating class cluster no longer changes.
Preferably, z is setkK-th barycenter of class cluster in base Clustering Vector Π is represented, is described as rk dimensional vectors:
Wherein,
Preferably, setting vector Π is described as a vector x for rk dimensionsl, then xlWith zkBetween COS distance be:
Wherein wi represents i-th weight of base cluster, and value is 1/r when in the absence of priori.
Preferably, barycenter zkIt is updated using equation below:
WhereinIt is the constant vector on Π,
Represent the quantity of example in i-th k-th cluster of base cluster;
ForWithFor, if the given real vector y, | | y | | of d dimensionspThe Lp norms of y are represented, i.e.,
The technical scheme that the present invention is provided has the advantages that:
The sampling distribution formula clustering method based on public sentiment platform reduces data scale using sampling techniques, by many wheels Sampling improves the diversity of base cluster result, then defines COS distance and base cluster result is integrated into final cluster result, Therefore, it is possible to effectively improve the cluster efficiency of mass data;
Also, by introducing sampling techniques, data diversity is improved while reduction data scale, then using distribution Computational frame designs two stage cluster process, to improve the clustering result quality and efficiency of public sentiment project analysis in internet big data There is provided effective ways.
Brief description of the drawings
Fig. 1 is the FB(flow block) of the sampling distribution formula clustering method based on public sentiment platform provided in an embodiment of the present invention.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
The description of specific distinct unless the context otherwise, element and component in the present invention, quantity both can be with single shape Formula is present, it is also possible to which multiple forms is present, and the present invention is defined not to this.Although the step in the present invention is entered with label Arrangement is gone, but has been not used to limit the precedence of step, unless expressly stated the order of step or holding for certain step Row is needed based on other steps, and the relative rank of otherwise step is adjustable.It is appreciated that used herein Term "and/or" is related to and covers one of associated Listed Items or one or more of any and all possible group Close.
Fig. 1 is referred to, the sampling distribution formula clustering method 100 based on public sentiment platform provided in an embodiment of the present invention is included such as Lower step:
S1, acquisition data to be clustered, and burst treatment is carried out to the data to be clustered, obtain multiple bursts.
In step sl, treating cluster data carries out horizontal segmentation, some bursts (Sharding) is obtained, in cutting procedure In should ensure that the integrality of every data (such as newsletter archive etc.).And, the burst storage for obtaining will be split in distribution In formula file system such as HDFS, the size of the burst is determined by selected distributed file system, each burst in such as HDFS Size is 64M.And, by accessing distributed file system, calculate node can share burst, and be localized by calculating, It is effectively reduced I/O consumption.
S2, carry out sampling of data using each burst of Map function pairs in MapReduce.
Specifically, in step s 2, sampling of data is carried out on each burst, for the consideration divided and ruled with efficiency, Sampling techniques should at least meet the following requirements:1st, sampling techniques need in itself it is enough simple, can otherwise turn into new bottleneck 2, Sampling can carry out should having certain randomness without relying on global view 3, sampling results based on local data.
And, the more than satisfaction methods of sampling of some can be applied in the present invention, not do specific restriction to this.And In step s 2, subsampling operation is to realize that this is denoted as first stage Map by Map functions in MapReduce frameworks Process.
S3, the data from the sample survey that will be obtained collect, and to taking out described in collecting during the Reduce of MapReduce frameworks Sample data are clustered.
Specifically, in step s3, to the sampling results of each round, using specific sampling of data round as key, obtain Data from the sample survey as value, in converging to a Reduce function of MapReduce by shuffle functions, described Data from the sample survey is clustered in Reduce functions, this is denoted as first stage Reduce process.
And, specific clustering method includes but is not limited to K averages, spectral clustering and hierarchical clustering etc., to this present invention not Limit.
S4, it is repeated in that step S2 and step S3 is total to carry out r wheel sampling of datas, by the cluster of the data from the sample survey of each round Result is denoted as base cluster, and obtains the vector of Π={ π 1, π 2 ..., π r }, wherein, r is the positive integer more than or equal to 2, and π i are the The base cluster of i wheels, 1≤i≤r, and be positive integer.
S5, to reuse MapReduce frameworks by the base clustering ensemble be final cluster result.
In step s 5, clustering ensemble is carried out to vectorial Π, and each described base cluster is considered as entirety, so as to calculate every Distance between the individual base cluster.
Specifically, the step S5 comprises the following steps:
A number of base cluster is randomly choosed as barycenter, and with Map functions calculate other described bases cluster with Distance between the barycenter, each base is clustered in the class cluster where being assigned to the barycenter closest with it, and The barycenter of class cluster is updated in Reduce functions, this is denoted as second stage Map processes and second stage Reduce processes;
This process is repeated until the barycenter of the class cluster no longer changes.
In the present embodiment, the class cluster of the calculating and base cluster of entering row distance during the Map of the second stage refers to Group;The renewal of barycenter in being carried out during the Reduce of the second stage.
And, during the Map of the second stage, the sampling distribution formula clustering method based on public sentiment platform 100 define COS distance is calculated:
Setting zkK-th barycenter of class cluster in base Clustering Vector Π is represented, is described as rk dimensional vectors:
Wherein,
And, setting vector Π is described as a vector x for rk dimensionsl, then xlWith zkBetween COS distance be:
Wherein wiI-th weight of base cluster is represented, value is 1/r when in the absence of priori.
During the Reduce of the second stage, after all base clusters are assigned to a certain class cluster in step S5, Update the barycenter of all class clusters, barycenter zkIt is updated using equation below:
Wherein,It is the constant vector on Π,
Represent the quantity of example in i-th k-th cluster of base cluster;
ForWithFor, if the given real vector y, | | y | | of d dimensionspThe Lp norms of y are represented, i.e.,
In the calculating process that barycenter during the Reduce of the second stage updates, when each class cluster barycenter more After new, the Map processes of the second stage are repeated, and recalculate the distance of each base cluster and new barycenter and carry out base and gather The class cluster of class is assigned, untill class cluster barycenter no longer changes.
By taking microblog users group discovery as an example, the specific embodiment of the invention and step are described in detail.User data is included Its association attributes, such as age, sex, hobby, pay close attention to, be concerned, forwarding, may be defined as a vector, colony's discovery is root Clustered according to user vector, be a colony by similitude user clustering higher.Because the quantity of microblog users is extremely huge Greatly, it is adaptable to distributed clustering method proposed by the invention.
Mass users are stored on HDFS bursts first, each burst 64M ensures the data of each user not during storage It is divided, i.e., the data storage of unique user is on same burst.
Using the Map functions of distributed memory Computational frame Spark (of MapReduce frameworks implements) every User is randomly choosed on individual burst.
The user chosen on all bursts is focused on into same node to be clustered, the foundation of cluster is user vector, profit Realize that clustering method can select conventional K mean cluster method, the machine learning storehouse in Spark with the Reduce functions of Spark MLib provides K mean algorithms implementing in Spark.Repeat said process n times, a base cluster is obtained every time.
After obtaining n base cluster, these bases cluster is carried out integrated, integrating process is in " appointments of class cluster " and " barycenter renewal " Between iterate, it is same to be realized using distributed memory Computational frame Spark.Data in Spark are by elasticity distribution formula data Collection (Resilient Distributed Datasets, RDD) carries out abstract and description, and this is also the most important cores of Spark One of technology, all of operation is all based on RDD to be carried out.
Here, example is the x in formula (3)l, this is a higher-dimension sparse vector, to save memory space, in RDD Two arrays are modeled as, non-zero index are deposited respectively and is considered that the quantity of example is very big with numerical value, vector is constituted first Matrix carry out horizontal fragmentation, each burst import be a RDD, each RDD correspondence one Map task, calculate RDD in example With the distance of all barycenter, the appointment (classify) of example is carried out.
It is worth noting that, in the calculating of COS distance, it is not necessary that each vectorial is calculated, and is only needed The inner product indexed between identical in two vectors is calculated, This further reduces the complexity for calculating.In Map outputs Key-value centering, key represents the numbering of cluster, and value represents specific example, in the Reduce stages then according to all key identical values, i.e., Belong to the renewal (recenter) that the example of same class cluster carries out barycenter, the new barycenter for obtaining (including randomly choosed during initialization K barycenter) be sent to all RDD using the broadcast mechanism of Spark, the iteration of next round is carried out, until barycenter or class cluster No longer change.
Compared to prior art, the embodiment of the present invention has the advantages that:
The sampling distribution formula clustering method 100 based on public sentiment platform reduces data scale using sampling techniques, passes through Many wheel sampling improve the diversity of base cluster result, then define COS distance and base cluster result is integrated into final cluster knot Really, therefore, it is possible to effectively improve the cluster efficiency of mass data;
Also, by introducing sampling techniques, data diversity is improved while reduction data scale, then using distribution Computational frame designs two stage cluster process, to improve the clustering result quality and efficiency of public sentiment project analysis in internet big data There is provided effective ways.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be in other specific forms realized.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires to be limited rather than described above, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as the claim involved by limitation.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each implementation method is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art should Specification an as entirety, the technical scheme in each embodiment can also be formed into those skilled in the art through appropriately combined May be appreciated other embodiment.

Claims (8)

1. a kind of sampling distribution formula clustering method based on public sentiment platform, it is characterised in that:Comprise the following steps:
First, data to be clustered are obtained, and burst treatment is carried out to the data to be clustered, obtain multiple bursts;
2nd, sampling of data is carried out using each burst of Map function pairs in MapReduce;
3rd, the data from the sample survey that will be obtained collects, and the sampling number during the Reduce of MapReduce frameworks to collecting According to being clustered;
4th, it is repeated in step 2 and the total r that carries out of step 3 takes turns sampling of data, by the cluster result of the data from the sample survey of each round It is denoted as base to cluster, and obtains the vector of Π={ π 1, π 2 ..., π r }, wherein, r is the positive integer more than or equal to 2, and π i are the i-th wheel Base cluster, 1≤i≤r, and be positive integer;
5th, it is final cluster result to reuse MapReduce frameworks by the base clustering ensemble.
2. the sampling distribution formula clustering method based on public sentiment platform according to claim 1, it is characterised in that:In step In, the data to be clustered are carried out with horizontal segmentation, and ensure the integrality per data in cutting procedure, and will split To the burst store in distributed file system.
3. the sampling distribution formula clustering method based on public sentiment platform according to claim 1, it is characterised in that:The step Carry out that sampling of data at least meets in two requires to include:Simple enough, sampling is carried out sampling techniques based on local data in itself There is certain randomness with sampling results.
4. the sampling distribution formula clustering method based on public sentiment platform according to claim 1, it is characterised in that:In step 3 In, using specific sampling of data round as key, the data from the sample survey for obtaining is converged to as value by shuffle functions In one Reduce function of MapReduce, data from the sample survey is clustered in the Reduce functions.
5. the sampling distribution formula clustering method based on public sentiment platform according to claim 1, it is characterised in that:In step 5 In comprise the following steps:
A number of base cluster is randomly choosed as barycenter, and with Map functions calculate other described bases cluster with it is described Distance between barycenter, each base is clustered in the class cluster where being assigned to the barycenter closest with it, and The barycenter of class cluster is updated in Reduce functions;
This process is repeated until the barycenter of the class cluster no longer changes.
6. the sampling distribution formula clustering method based on public sentiment platform according to claim 5, it is characterised in that:Setting zkTable Show k-th barycenter of class cluster in base Clustering Vector Π, be described as rk dimensional vectors:
z k = ( z k 1 , ... , z k i , ... z k r ) ,
Wherein,
7. the sampling distribution formula clustering method based on public sentiment platform according to claim 6, it is characterised in that:Setting vector Π is described as a vector x for rk dimensionsl, then xlWith zkBetween COS distance be:
C o s D i s t ( x l , z k ) = Σ i = 1 r w i ( 1 - c o s ( x l i , z k i ) )
Wherein wi represents i-th weight of base cluster, and value is 1/r when in the absence of priori.
8. the sampling distribution formula clustering method based on public sentiment platform according to claim 5, it is characterised in that:Barycenter zkProfit It is updated with equation below:
z k i ′ = | | z k i | | 2 - | | T ( i ) | | 2
WhereinIt is the constant vector on Π,
Represent the quantity of example in i-th k-th cluster of base cluster;
ForWith | | T(i)||2For, if given real vector y, | | y | | the p of d dimensions represents the Lp norms of y, i.e.,
CN201611260883.5A 2016-12-30 2016-12-30 A kind of sampling distribution formula clustering method based on public sentiment platform Pending CN106874367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611260883.5A CN106874367A (en) 2016-12-30 2016-12-30 A kind of sampling distribution formula clustering method based on public sentiment platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611260883.5A CN106874367A (en) 2016-12-30 2016-12-30 A kind of sampling distribution formula clustering method based on public sentiment platform

Publications (1)

Publication Number Publication Date
CN106874367A true CN106874367A (en) 2017-06-20

Family

ID=59164125

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611260883.5A Pending CN106874367A (en) 2016-12-30 2016-12-30 A kind of sampling distribution formula clustering method based on public sentiment platform

Country Status (1)

Country Link
CN (1) CN106874367A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107516110A (en) * 2017-08-22 2017-12-26 华南理工大学 A kind of medical question and answer Semantic Clustering method based on integrated convolutional encoding
CN110008336A (en) * 2019-01-14 2019-07-12 阿里巴巴集团控股有限公司 A kind of public sentiment method for early warning and system based on deep learning
CN110704515A (en) * 2019-12-11 2020-01-17 四川新网银行股份有限公司 Two-stage online sampling method based on MapReduce model
CN110909817A (en) * 2019-11-29 2020-03-24 深圳市商汤科技有限公司 Distributed clustering method and system, processor, electronic device and storage medium
WO2021249502A1 (en) * 2020-06-12 2021-12-16 支付宝(杭州)信息技术有限公司 Method and apparatus for clustering privacy data of multiple parties

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156463A (en) * 2014-08-21 2014-11-19 南京信息工程大学 Big-data clustering ensemble method based on MapReduce
CN104809242A (en) * 2015-05-15 2015-07-29 成都睿峰科技有限公司 Distributed-structure-based big data clustering method and device
CN104820708A (en) * 2015-05-15 2015-08-05 成都睿峰科技有限公司 Cloud computing platform based big data clustering method and device
CN106095791A (en) * 2016-01-31 2016-11-09 长源动力(山东)智能科技有限公司 A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof
US20160350146A1 (en) * 2015-05-29 2016-12-01 Cisco Technology, Inc. Optimized hadoop task scheduler in an optimally placed virtualized hadoop cluster using network cost optimizations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156463A (en) * 2014-08-21 2014-11-19 南京信息工程大学 Big-data clustering ensemble method based on MapReduce
CN104809242A (en) * 2015-05-15 2015-07-29 成都睿峰科技有限公司 Distributed-structure-based big data clustering method and device
CN104820708A (en) * 2015-05-15 2015-08-05 成都睿峰科技有限公司 Cloud computing platform based big data clustering method and device
US20160350146A1 (en) * 2015-05-29 2016-12-01 Cisco Technology, Inc. Optimized hadoop task scheduler in an optimally placed virtualized hadoop cluster using network cost optimizations
CN106095791A (en) * 2016-01-31 2016-11-09 长源动力(山东)智能科技有限公司 A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蔡静颖等: "《模糊聚类算法及应用》", 31 August 2015 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107516110A (en) * 2017-08-22 2017-12-26 华南理工大学 A kind of medical question and answer Semantic Clustering method based on integrated convolutional encoding
CN107516110B (en) * 2017-08-22 2020-02-18 华南理工大学 Medical question-answer semantic clustering method based on integrated convolutional coding
CN110008336A (en) * 2019-01-14 2019-07-12 阿里巴巴集团控股有限公司 A kind of public sentiment method for early warning and system based on deep learning
CN110008336B (en) * 2019-01-14 2023-04-07 创新先进技术有限公司 Public opinion early warning method and system based on deep learning
CN110909817A (en) * 2019-11-29 2020-03-24 深圳市商汤科技有限公司 Distributed clustering method and system, processor, electronic device and storage medium
CN110909817B (en) * 2019-11-29 2022-11-11 深圳市商汤科技有限公司 Distributed clustering method and system, processor, electronic device and storage medium
CN110704515A (en) * 2019-12-11 2020-01-17 四川新网银行股份有限公司 Two-stage online sampling method based on MapReduce model
WO2021249502A1 (en) * 2020-06-12 2021-12-16 支付宝(杭州)信息技术有限公司 Method and apparatus for clustering privacy data of multiple parties

Similar Documents

Publication Publication Date Title
CN106874367A (en) A kind of sampling distribution formula clustering method based on public sentiment platform
Lebedev et al. Fast convnets using group-wise brain damage
CN102129451B (en) Method for clustering data in image retrieval system
US20150242497A1 (en) User interest recommending method and apparatus
CN104200369B (en) Method and device for determining commodity distribution range
CN104615779B (en) A kind of Web text individuations recommend method
CN103902704B (en) Towards the multidimensional inverted index and quick retrieval of large-scale image visual signature
CN106210044B (en) A kind of any active ues recognition methods based on access behavior
CN109885640B (en) Multi-keyword ciphertext sorting and searching method based on alpha-fork index tree
Chen et al. Coarsening the granularity: Towards structurally sparse lottery tickets
CN107180079B (en) Image retrieval method based on convolutional neural network and tree and hash combined index
CN111125469B (en) User clustering method and device of social network and computer equipment
CN106570173B (en) Spark-based high-dimensional sparse text data clustering method
CN111177410A (en) Knowledge graph storage and similarity retrieval method based on evolution R-tree
CN109840551B (en) Method for optimizing random forest parameters for machine learning model training
CN109033453A (en) A kind of film recommended method and system based on RBM Yu the cluster of difference secret protection
CN106503146B (en) The feature selection approach of computer version
CN106897276A (en) A kind of internet data clustering method and system
CN105354343B (en) User characteristics method for digging based on remote dialogue
CN110825738A (en) Data storage and query method and device based on distributed RDF
CN104978395B (en) Visual dictionary building and application method and device
Adinugroho et al. Optimizing K-means text document clustering using latent semantic indexing and pillar algorithm
CN103761298B (en) Distributed-architecture-based entity matching method
US20150012563A1 (en) Data mining using associative matrices
CN107704872A (en) A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170620