CN113010759B - Cluster set processing method and device, computer readable medium and electronic equipment - Google Patents
Cluster set processing method and device, computer readable medium and electronic equipment Download PDFInfo
- Publication number
- CN113010759B CN113010759B CN202110261725.6A CN202110261725A CN113010759B CN 113010759 B CN113010759 B CN 113010759B CN 202110261725 A CN202110261725 A CN 202110261725A CN 113010759 B CN113010759 B CN 113010759B
- Authority
- CN
- China
- Prior art keywords
- cluster
- sets
- information
- clustering
- contained
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title abstract description 9
- 238000012545 processing Methods 0.000 claims abstract description 96
- 238000000034 method Methods 0.000 claims abstract description 83
- 238000012216 screening Methods 0.000 claims abstract description 21
- 238000004590 computer program Methods 0.000 claims description 15
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 39
- 230000015556 catabolic process Effects 0.000 abstract description 3
- 238000006731 degradation reaction Methods 0.000 abstract description 3
- 230000000875 corresponding effect Effects 0.000 description 74
- 238000005516 engineering process Methods 0.000 description 21
- 238000010586 diagram Methods 0.000 description 16
- 238000004364 calculation method Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 10
- 238000013473 artificial intelligence Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 238000003058 natural language processing Methods 0.000 description 6
- 230000009467 reduction Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 241000700605 Viruses Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- ZXQYGBMAQZUVMI-GCMPRSNUSA-N gamma-cyhalothrin Chemical compound CC1(C)[C@@H](\C=C(/Cl)C(F)(F)F)[C@H]1C(=O)O[C@H](C#N)C1=CC=CC(OC=2C=CC=CC=2)=C1 ZXQYGBMAQZUVMI-GCMPRSNUSA-N 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9035—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application provides a cluster set processing method, a cluster set processing device, a computer readable medium and electronic equipment. The method comprises the following steps: acquiring a plurality of cluster sets to be processed, wherein each cluster set contains a plurality of cluster information; determining a first cluster set with the latest updating time after a preset time point from the plurality of cluster sets according to the latest updating time of each cluster set; screening out a cluster set with the content of the cluster information greater than or equal to a set threshold according to the content of the cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, and obtaining a screened second cluster set; and generating processing results for the plurality of cluster sets according to the first cluster set and the second cluster set. The technical scheme of the embodiment of the application can effectively solve the problem of algorithm performance degradation caused by the increase of the cluster sets in the incremental clustering algorithm.
Description
Technical Field
The present application relates to the field of computers and communications technologies, and in particular, to a method and apparatus for processing a cluster set, a computer readable medium, and an electronic device.
Background
The incremental clustering method is to newly add some data based on a batch of clustering results, only cluster the newly added data, and incrementally modify the existing clustering results without re-clustering the whole data set after the new data. The incremental clustering method realizes the clustering of the newly added data and simultaneously faces the problem of data redundancy. As the amount of data increases, it is inevitable to cause a decrease in computational performance, however, there is no effective solution to this problem in the related art.
Disclosure of Invention
The embodiment of the application provides a method and a device for processing a clustering set, a computer readable medium and electronic equipment, and further at least to a certain extent, the problem of algorithm performance reduction caused by the increase of the clustering set in an incremental clustering algorithm can be effectively solved.
Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.
According to an aspect of an embodiment of the present application, there is provided a method for processing a cluster set, including: acquiring a plurality of cluster sets to be processed, wherein each cluster set contains a plurality of cluster information; determining a first cluster set with the latest updating time after a preset time point from the plurality of cluster sets according to the latest updating time of each cluster set; screening out a cluster set with the content of the cluster information greater than or equal to a set threshold according to the content of the cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, and obtaining a screened second cluster set; and generating processing results for the plurality of cluster sets according to the first cluster set and the second cluster set.
According to an aspect of an embodiment of the present application, there is provided a cluster set processing apparatus, including: the acquisition unit is configured to acquire a plurality of cluster sets to be processed, wherein each cluster set contains a plurality of cluster information; a determining unit configured to determine a first cluster set, from the plurality of cluster sets, from which the latest update time is after a predetermined point in time, according to the latest update time of the respective cluster sets; a screening unit, configured to screen out a cluster set with the content of the cluster information greater than or equal to a set threshold according to the content of the cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, so as to obtain a screened second cluster set; and the generating unit is configured to generate processing results for the plurality of cluster sets according to the first cluster set and the second cluster set.
In some embodiments of the application, based on the foregoing, the screening unit comprises: a calculating subunit configured to calculate information content scores corresponding to other cluster sets except the first cluster set according to cluster information contents contained in the other cluster sets in the plurality of cluster sets; and the screening subunit is configured to screen the cluster set with the information content score larger than or equal to a preset score threshold value from the other cluster sets to obtain a screened second aggregation.
In some embodiments of the application, based on the foregoing scheme, the computing subunit is configured to: acquiring the content of the specific type of clustering information contained in the other clustering sets and the ratio between the content of the specific type of clustering information contained in the other clustering sets and the content of the contained clustering information; and calculating the information content scores corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets, the specific type of cluster information content contained in the other cluster sets and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content.
In some embodiments of the application, based on the foregoing scheme, the computing subunit is configured to: determining target calculated values corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets; and calculating the information content scores corresponding to the other cluster sets according to the target calculated values corresponding to the other cluster sets, the specific type of cluster information content contained in the other cluster sets and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content.
In some embodiments of the application, based on the foregoing scheme, the computing subunit is configured to: determining a numerical interval corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets; and determining target calculated values corresponding to the other cluster sets according to the numerical value intervals corresponding to the other cluster sets and the corresponding relation between the numerical value intervals and the calculated values.
In some embodiments of the application, based on the foregoing scheme, the computing subunit is configured to: carrying out weighted summation on target calculated values corresponding to the other cluster sets, the content of the specific type of cluster information contained in the other cluster sets and the ratio between the content of the specific type of cluster information contained in the other cluster sets and the content of the cluster information contained in the other cluster sets to obtain an operation result; and determining the information content scores corresponding to the other clustering sets according to the operation result.
In some embodiments of the application, based on the foregoing scheme, the computing subunit is configured to: calculating the sum of the operation result and a preset constant value; and carrying out normalization processing on the calculated sum value to obtain a normalized result value, and taking the normalized result value as the information content score corresponding to the other clustering set.
In some embodiments of the application, based on the foregoing, the apparatus further comprises: the set acquisition unit is configured to acquire a clustering set matched with the information to be clustered from the processing results of the plurality of clustering sets to be processed if the information to be clustered newly added exists; and the adding unit is configured to add the information to be clustered into the matched cluster set.
In some embodiments of the application, based on the foregoing scheme, the generating unit includes: an acquisition subunit configured to acquire identification numbers of the respective cluster sets; and the processing subunit is configured to take the identification numbers of the cluster sets, the first cluster set and the second cluster set as processing results for the plurality of cluster sets.
In some embodiments of the application, based on the foregoing, the apparatus further comprises: an identification number obtaining unit configured to obtain identification numbers of each cluster set from processing results of the plurality of cluster sets to be processed if a newly added cluster set exists, so as to obtain a plurality of identification numbers; an identification number generating unit configured to generate an identification number of the newly added cluster set according to the plurality of identification numbers.
In some embodiments of the present application, based on the foregoing scheme, the identification number generation unit is configured to: acquiring a maximum identification number from the plurality of identification numbers; and calculating the sum of the maximum identification number and a preset value, and taking the calculated sum value as the identification number of the newly added cluster set.
In some embodiments of the application, based on the foregoing, the apparatus further comprises: and a storage unit configured to store the cluster sets other than the second cluster set in the other cluster sets in an offline manner.
According to an aspect of an embodiment of the present application, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements a method of processing a set of clusters as described in the above embodiments.
According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: one or more processors; and a storage device for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the method for processing a cluster set as described in the above embodiments.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method of processing a cluster set provided in the various alternative embodiments described above.
In the technical solutions provided in some embodiments of the present application, a plurality of cluster sets to be processed are obtained, then, according to the latest update time of each cluster set, a first cluster set with the latest update time being after a predetermined time point is determined from the plurality of cluster sets, and according to the content of cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, a cluster set with the content of cluster information greater than or equal to a set threshold is screened out, so as to obtain a screened second cluster set, and further, according to the first cluster set and the second cluster set, a processing result for the plurality of cluster sets is generated. According to the technical scheme, the plurality of cluster sets are not processed from a single angle, but are processed by combining the latest update time and the included cluster information content, so that the accuracy of a processing result is higher, the unimportant cluster sets are cleaned, the important cluster sets are reserved, and the problem of algorithm performance reduction caused by the increase of the cluster sets in an incremental clustering algorithm is effectively solved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:
FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of an embodiment of the application may be applied;
FIG. 2 illustrates a flow chart of a method of processing a cluster set according to one embodiment of the application;
FIG. 3 illustrates a flow chart of a method of processing a cluster set according to one embodiment of the application;
FIG. 4 shows a flow chart of a method of processing a cluster set according to one embodiment of the application;
FIG. 5 shows a flow chart of a method of processing a cluster set according to one embodiment of the application;
FIG. 6 shows a flow chart of a method of processing a cluster set according to one embodiment of the application;
FIG. 7 shows a flow chart of a method of processing a cluster set according to one embodiment of the application;
FIG. 8 illustrates a logic diagram of a method of processing a cluster set in accordance with one embodiment of the application;
FIG. 9 shows a block diagram of a processing apparatus of a cluster set, according to one embodiment of the application;
fig. 10 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
It should be noted that the terms used in the description of the present application and the claims and the above-mentioned drawings are only used for describing the embodiments, and are not intended to limit the scope of the present application. It will be understood that the terms "comprises," "comprising," "includes," "including" and/or "having," when used herein, specify the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It will be further understood that, although the terms "first," "second," "third," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element without departing from the scope of the present application. Similarly, the second element may be referred to as a first element. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
It should be noted that: references herein to "a plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.
Among them, natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Clustering is an important branch in natural language processing. In short, clustering is a process of dividing a collection of physical or abstract objects into multiple classes composed of similar objects, i.e., classifying objects into different classes (or clusters), objects in the same class have a large similarity, and objects belonging to different classes have a large variability. The clustering algorithm is mainly classified into a hierarchical method, a division method, a density-based method, a grid-based method and a model-based method.
The hierarchy method is to decompose data in a similar manner, and stop when a given condition is met. Hierarchical methods can be divided into two categories: a bottom-up hierarchy and a top-down hierarchy. The partitioning method is to divide the objects in the set into m subsets, so that the similarity of the objects in the subsets is higher, while the similarity of the objects in the different subsets is lower. The density-based method is to define a threshold value and then compare the density of the data with the threshold value, and if the density is greater than the threshold value, the data is placed in the similar class, and if the density is less than the threshold value, the data is placed in the original class. The grid-based method is to divide the data into a plurality of grids and then process the data using a grid data structure. The model-based method is to give a model and then find the data satisfying the model.
When the clustering algorithm is executed to cluster the data, a batch of clustering results can be obtained, wherein the clustering results are a class after clustering, and the data in the same clustering result have close characteristics. When a cluster result is available, new data often appear in the database, if the new data is clustered again after each new data, a lot of resources are consumed and time is wasted, if a clustering algorithm can be executed on the new data, and the result is the same as the result of the clustering again, the waste of resources is reduced, and meanwhile, the time is saved, namely the incremental clustering.
The incremental clustering is to cluster only the newly added data when a lot of clustering results exist and some data is newly added, and to perform incremental modification on the existing clustering results, so that the whole data set after the data is newly added is not required to be clustered again. For example, the new data may be divided into one of the existing cluster results, or may be divided into the new cluster results.
Incremental clustering algorithms include incremental clustering algorithms based on conventional clustering method variations, such as density-based incremental clustering algorithms, grid-based incremental clustering algorithms, and the like. The density-based incremental clustering algorithm is to acquire data in the neighborhood of the newly added data by taking the maximum value of class densities of all classes as a neighborhood radius for the newly added data, and determine the class to which the newly added data belongs according to the class density of each class to which the data in the neighborhood of the newly added data belongs. The incremental clustering algorithm based on the grid is based on the formed network structure, and dynamically and incrementally dividing the grid structure through the dimension radius according to the condition that data are continuously increased.
With the wide application of incremental clustering algorithms, the incremental clustering algorithms realize clustering of newly added data and face increasingly redundant clustering results. Because in the incremental clustering algorithm, whether the newly added data belongs to the existing clustering result is judged according to the set threshold value, if the similarity is smaller than the set threshold value, the newly added data is added to a new clustering result, and as a result, the clustering result is inevitably and rapidly increased along with the increase of the newly added data. However, a large number of clustering results exist in the clustering results, which are the clustering results only containing a small number of elements, and the clustering results are not core clustering results required by the incremental clustering algorithm. Since the single computation complexity of the incremental clustering algorithm is positively correlated with the quantity of the clustering results, the increase of the clustering results inevitably leads to the reduction of the computation performance.
When the related technology faces the problem, generally, unavailable clustering results are periodically screened on line, and the screened clustering results are uploaded to the line for use, so that the purpose of reducing the clustering results is achieved. However, this method has drawbacks in that:
(1) Blank time window problem. The related art needs to screen the online clustering result offline, which can cause that online incremental data in the time window cannot be processed in time.
(2) In the related technology, through regular screening of group data, if the newly increased data quantity suddenly increases, the on-line calculation time is increased due to untimely screening and cleaning of clustering results, so that the problem of excessive on-line time delay is caused.
(3) The related technology needs to manually and actively screen the clustering result regularly, which aggravates the complexity of the whole incremental clustering algorithm flow, and the consistency of the screening rule cannot be ensured by each manual processing, so that the instability and the irrecoverability of the clustering result can be increased.
In this regard, the embodiment of the present application provides a method for processing a cluster set, where the cluster set targeted in the method for processing a cluster set provided by the embodiment of the present application may be a cluster result obtained by incremental clustering. In some embodiments, the clustering set obtained by the incremental clustering method may specifically include the following steps:
acquiring incremental data after relative initial data;
calculating the similarity between the incremental data and each initial clustering set to obtain a plurality of similarities, wherein the initial clustering set is obtained by clustering the initial data;
if the maximum similarity in the plurality of similarities is greater than or equal to the preset similarity, adding the incremental data into an initial cluster set corresponding to the maximum similarity in the plurality of similarities;
If the maximum similarity in the plurality of similarities is smaller than the preset similarity, a new cluster set is created, and incremental data is added into the new cluster set.
In this embodiment, for the initial data, the clustering may be performed using a K-means algorithm, which is relatively simple and fast. The data size of the initial data is generally large, and the average algorithm is relatively good. After the clustering is completed, a clustering center and a clustering set formed by the clustering center and other initial data can be obtained, and the data in the same clustering set have the characteristic of being close.
The incremental data corresponds to global data, which can be obtained by comparison with the initial data. For each incremental data, the similarity between it and the respective initial cluster set may be calculated. If the maximum similarity in the calculated multiple similarities is greater than or equal to the preset similarity, the incremental data is similar to the data of the initial clustering set corresponding to the maximum similarity obtained by clustering the initial data, and the incremental data can be clustered into the initial clustering set corresponding to the maximum similarity. Otherwise, if the maximum similarity in the plurality of similarities is smaller than the preset similarity, the similarity between the incremental data and the initial data is lower, and the incremental data can be added into the new cluster set instead of being clustered into the initial cluster set.
In some embodiments, calculating the similarity between the incremental data and each initial cluster set may specifically include calculating a distance of the incremental data from a cluster center of each initial cluster set, the greater the distance, the smaller the distance, and the greater the similarity. The distance of the incremental data from the cluster center of each initial cluster set may be an absolute distance, a euclidean distance, a chebyshev distance, or the like. After the distances are calculated, if the minimum distance in the distances is smaller than or equal to the preset distance, the incremental data is relatively close to the data of the initial clustering set corresponding to the minimum distance obtained by clustering the initial data, and the incremental data can be clustered into the initial clustering set corresponding to the minimum distance. Conversely, if the minimum distance of the distances is greater than the preset distance, the similarity between the incremental data and the initial data is low, and for the purpose of clustering accuracy, the incremental data may not be clustered into the initial cluster set, but a new cluster set may be created, and the incremental data may be added into the new cluster set.
After clustering the incremental data to obtain a plurality of cluster sets, the method for processing the cluster sets provided by the embodiment of the application can be further adopted to process the obtained plurality of cluster sets, namely, the plurality of cluster sets to be processed are obtained, then, according to the latest updating time of each cluster set, a first cluster set with the latest updating time being behind a preset time point is determined from the plurality of cluster sets, according to the content of cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, the cluster sets with the content of the cluster information being larger than or equal to a set threshold value are screened out, a screened second cluster set is obtained, and further, a processing result for the plurality of cluster sets is generated according to the first cluster set and the second cluster set.
According to the technical scheme, the plurality of cluster sets are not processed from a single angle, but are processed by combining the latest update time and the included cluster information content, so that the accuracy of a processing result is higher, the unimportant cluster sets are cleaned, the important cluster sets are reserved, and the problem of algorithm performance reduction caused by the increase of the cluster sets in an incremental clustering algorithm is effectively solved.
It should be noted that, the above method for processing the cluster set may be applied to a processing scene of the cluster set obtained by performing incremental clustering on any data, for example, for an application scene of user data, the cluster set may be a user set formed by a plurality of similar users, and for an application scene of image data, the cluster set may also be an image set formed by a plurality of similar images.
For the registered users of which the user data is social application, clustering is carried out on the registered users, a group of clustering sets can be obtained, wherein the clustering sets are user sets formed by a plurality of similar registered users, when the registered users are newly added, only the newly added registered users are clustered, the existing clustering sets are modified in an incremental mode, and the newly added registered users are classified into one of the existing clustering sets or the newly added registered users are classified into the new clustering sets.
However, as newly added registered users increase, the number of cluster sets increases. Since the single computation complexity of the incremental clustering algorithm is positively correlated with the quantity of the cluster sets, the increase of the cluster sets inevitably leads to the reduction of the computation performance of the incremental clustering on the newly added registered users. Therefore, the cluster set obtained by incremental clustering of the registered user can be processed through the processing method of the cluster set, and the processing result of the cluster set obtained by incremental clustering of the registered user is obtained, so that the unimportant cluster set obtained by incremental clustering of the registered user is cleaned, the important cluster set is reserved, and the problem of algorithm performance degradation caused by the increase of the cluster set in an incremental clustering algorithm is effectively solved.
It should be further noted that the above application scenario is merely an exemplary example, and does not constitute a limitation of the application scenario of the technical solution of the embodiment of the present application, and the technical solution of the embodiment of the present application may be applied to a processing scenario of a cluster set obtained by performing incremental clustering on any data.
Turning first to a system architecture to which the method for processing a cluster set according to an embodiment of the present application may be applied, fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution according to an embodiment of the present application may be applied. As shown in fig. 1, the system architecture 100 may include one or more of the terminal devices 101, 102, 103, a network 104, and a server 105.
The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The terminal devices 101, 102, 103 are all terminal devices facing the user, and may specifically be smart devices such as smart phones, tablet computers, portable personal computers, desktop computers, smart speakers, smart watches, bracelets, smart televisions, and the like. The server 105 may be a stand-alone physical server, and may also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
The method for processing the cluster set provided by the embodiment of the application is generally executed by the server 105, and correspondingly, the processing device of the cluster set is generally arranged in the server 105. However, it is easy to understand by those skilled in the art that the method for processing a cluster set provided in the embodiment of the present application may also be performed by the terminal devices 101, 102, 103, and accordingly, the processing apparatus for a cluster set may also be provided in the terminal devices 101, 102, 103, which is not particularly limited in this exemplary embodiment. For example, in an exemplary embodiment, a user may upload, through the terminal devices 101, 102, 103, a plurality of to-be-processed cluster sets to the server 105, and the server 105 processes, through the processing method for the cluster sets provided by the embodiment of the present application, the plurality of cluster sets, and sends the obtained processing result to the terminal devices 101, 102, 103.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative, and that any number of terminal devices, networks, and servers may be provided as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.
The implementation details of the technical scheme of the embodiment of the application are described in detail below:
fig. 2 shows a flowchart of a method for processing a cluster set according to an embodiment of the present application, and in an embodiment of the present application, an example in which the method is applied to the server 105 is described. The processing method of the cluster set at least comprises the following steps:
step S210, obtaining a plurality of cluster sets to be processed, wherein each cluster set contains a plurality of cluster information;
step S220, determining a first cluster set with the latest update time after a preset time point from a plurality of cluster sets according to the latest update time of each cluster set;
step S230, screening out a cluster set with the content of the cluster information greater than or equal to a set threshold according to the content of the cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, and obtaining a screened second cluster set;
Step S240, according to the first cluster set and the second cluster set, processing results for a plurality of cluster sets are generated.
These steps are described in detail below.
In step S210, a plurality of cluster sets to be processed are acquired, and each cluster set includes a plurality of cluster information.
The clustering set is a clustering result obtained through a clustering algorithm, the clustering set comprises a plurality of pieces of clustering information, and the plurality of pieces of clustering information have great similarity. In this embodiment, the cluster set may be various types of sets, for example, the cluster set may be a user set formed by a plurality of similar users, or the cluster set may be an image set formed by a plurality of similar images.
When a batch of clustering results exist, some data are newly added, only the newly added data are clustered, the existing clustering results are modified incrementally, and the whole data set after the new data are not required to be clustered again, namely incremental clustering. With the wide application of incremental clustering algorithms, the incremental clustering algorithms realize clustering of newly added data and face increasingly redundant clustering results. Due to the nature of the incremental clustering algorithm, the clustering result must be increased rapidly with the increase of newly added data, and the increase of the clustering result must be reduced in computing performance.
Therefore, in order to solve the problem of the decrease in the computation performance due to the excessive clustering results, in this embodiment, a plurality of to-be-processed clustering sets may be acquired in advance, and then the plurality of clustering sets may be processed, and it is understood that the plurality of to-be-processed clustering sets are clustering results that have been obtained by a clustering algorithm.
In step S220, a first cluster set whose latest update time is after a predetermined point in time is determined from among the plurality of cluster sets according to the latest update time of each cluster set.
It can be understood that when the cluster information is newly added, the cluster set matched with the newly added cluster information can update the cluster information in the set by adding the newly added cluster information in the set, that is, the update time refers to the time of updating the cluster information in the set by each cluster set, and the latest update time refers to the time of updating the cluster information in the set by each cluster set last time.
For example, if cluster set a is updated twice, the first update time is 9:00, the second update time is 10:00, the latest update time of the cluster set A is the second update time 10:00; if the cluster set B is updated only once, the update time is 11:00, the latest update time of the cluster set B is update time 11:00.
In this embodiment, after a plurality of cluster sets to be processed are acquired, processing for the plurality of cluster sets may be to first determine, from the plurality of cluster sets, a first cluster set whose latest update time is after a predetermined point in time according to the latest update time of each cluster set. The consideration of this is that although the cluster set containing a small amount of cluster information is not the core cluster set required by the incremental clustering algorithm, if the cluster set is processed only according to the amount of the cluster information contained in the cluster set, that is, the cluster set with a small amount of cluster information is directly cleaned, the incremental clustering accuracy is not high as a result. Because in this case, those cluster sets which are still updated continuously are ignored, but the cluster sets which are formed in a short time and contain a small amount of cluster information cannot participate in incremental cluster calculation, it is easy to understand that the cluster sets which are formed in a short time can be considered as new cluster sets, and the meaning of the new cluster sets for incremental cluster calculation is greater than that of the old cluster sets which are formed in a long time, so that if the new cluster sets cannot participate in incremental cluster calculation, the accuracy of the incremental cluster results is necessarily greatly reduced.
In view of this, the time point may be preset in this embodiment, and the preset time point may be set according to the actual situation, which is not specifically limited herein. After the time point is set, a first cluster set with the latest update time after the preset time point can be determined according to the latest update time of each cluster set.
In step S230, a cluster set with a cluster information content greater than or equal to a set threshold is selected according to the cluster information content contained in other cluster sets except the first cluster set in the plurality of cluster sets, so as to obtain a second cluster set.
As described above, the first cluster set is a cluster set of the plurality of cluster sets having a latest update time after the predetermined time point, and then other cluster sets of the plurality of cluster sets other than the first cluster set can be understood as a cluster set of the plurality of cluster sets having a latest update time before the predetermined time point, in other words, the other cluster sets have no cluster information in the update set in the period from the predetermined time point to the current time point.
Because the other cluster sets do not update the cluster information in the set within a period of time, the other cluster sets can be further screened according to the content of the contained cluster information to obtain a screened second cluster set, wherein the second cluster set is a cluster set with the content of the cluster information contained in the other cluster sets being greater than or equal to a set threshold, namely, a cluster set with the content of the cluster information being less than the set threshold is filtered from the other cluster sets, so that the filtered second cluster set is obtained. The setting threshold may be set according to actual situations, and the embodiment of the present application is not specifically limited herein.
In step S240, processing results for a plurality of cluster sets are generated from the first cluster set and the second cluster set.
In this embodiment, after determining, according to the latest update time of each cluster set, a first cluster set after the latest update time is at a predetermined time point from a plurality of cluster sets, and according to the content of cluster information included in other cluster sets except the first cluster set in the plurality of cluster sets, selecting a second cluster set from the other cluster sets, then a processing result for the plurality of cluster sets may be generated directly according to the first cluster set and the second cluster set.
In an embodiment, after the first cluster set and the second cluster set are obtained, the first cluster set and the second cluster set can be used as processing results of a plurality of cluster sets to be processed, that is, the processing results of the plurality of cluster sets only include the first cluster set and the second cluster set, and the cluster sets except the first cluster set and the second cluster set in the plurality of cluster sets can be directly cleaned, so that the number of cluster sets included in the processing results of the plurality of cluster sets is reduced, and the problem of performance degradation of a later incremental clustering algorithm is avoided.
In one embodiment of the present application, since a plurality of clustering sets to be processed have been processed, processing results for the plurality of clustering sets are obtained, so if newly added to-be-clustered information exists subsequently, that is, clustering needs to be performed for the newly added to-be-clustered information, a clustering set matching with the to-be-clustered information may be directly obtained from the processing results for the plurality of clustering sets to be processed, and then an operation of adding the to-be-clustered information to the matching clustering set is performed.
In this embodiment, compared with a plurality of cluster sets, the number of cluster sets included in the processing results of the plurality of cluster sets is significantly reduced, so that the calculated amount is reduced when the newly added information to be clustered is clustered, and the incremental clustering efficiency is improved.
Based on the technical scheme of the above embodiment, a first cluster set with the latest update time after a predetermined time point is determined from a plurality of cluster sets to be processed, a second cluster set is screened according to the content of cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, and further, a processing result for the plurality of cluster sets is generated according to the first cluster set and the second cluster set. According to the embodiment, the plurality of cluster sets are processed by combining the latest updating time and the included cluster information content, the accuracy of a processing result is higher, the unimportant cluster sets are cleaned, the important cluster sets are reserved, and the problem that the algorithm performance is reduced due to the fact that the cluster sets are increased in an incremental clustering algorithm is effectively solved.
In one embodiment of the present application, the condition of the content of the clustered information may be represented by an information content score, where the information content score may be calculated according to the content of the clustered information, and the higher the information content score, the more the content of the clustered information is represented, and according to the content of the clustered information contained in other clustered sets, the more or equal to the set threshold value of the clustered information is selected, and in this embodiment, the more or equal to the preset score threshold value of the clustered set is selected, and as shown in fig. 3, step S230 may specifically include:
step S310, calculating information content scores corresponding to other cluster sets according to the cluster information content contained in other cluster sets except the first cluster set in the plurality of cluster sets.
In this embodiment, the method for screening the other cluster sets according to the content of the cluster information included in the other cluster sets may be that the content score of the information corresponding to the other cluster sets is calculated first, and then the other cluster sets are screened according to the content score of the information.
It should be noted that the information content score is a score that can represent the content of the cluster information contained in other cluster sets, and the higher the information content score, the more the cluster information content is represented.
In one embodiment of the present application, as shown in fig. 4, the manner of calculating the information content scores corresponding to other cluster sets may specifically include steps S410 to S420, which are described in detail below:
step S410, obtaining the content of the specific type of cluster information contained in the other cluster sets, and the ratio between the content of the specific type of cluster information contained in the other cluster sets and the content of the contained cluster information.
The specific type of clustering information refers to abnormal clustering information in the clustering information, for example, in a user set formed by a plurality of users, the specific type of clustering information can be malicious users, the malicious users can be users for issuing malicious information, the malicious information can be fraud information, malicious advertisement information, pornography information and the like, in a website set formed by a plurality of websites, the specific type of clustering information can be malicious websites, the malicious websites refer to illegal websites such as viruses, worms, trojan horses and the like which intentionally execute malicious tasks on a computer system, and the malicious websites enable people to normally browse page contents in a certain webpage form and simultaneously illegally acquire various data in a user computer.
Considering that the incremental clustering algorithm can be used for realizing the discovery of the malicious cluster set according to the malicious degree of the cluster set, and the processing results of a plurality of cluster sets are involved in the subsequent incremental clustering algorithm, therefore, in the processing process of the plurality of cluster sets, especially when other cluster sets are screened according to the information content score, the situation of malicious cluster information can be considered for the information content score which is used as a screening basis.
In particular, the case of the specific type of clustering information may include a determination based on the specific type of clustering information content and a ratio between the specific type of clustering information content and the included clustering information content. Thus, in order to calculate the information content scores corresponding to the other cluster sets, the specific type of cluster information content contained in the other cluster sets may be obtained in advance, and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content may be obtained in advance.
Step S420, calculating the information content scores corresponding to the other cluster sets according to the cluster information contents contained in the other cluster sets, the specific type of cluster information contents contained in the other cluster sets and the ratio between the specific type of cluster information contents contained in the other cluster sets and the contained cluster information contents.
In this embodiment, after the specific type of the cluster information content included in the other cluster set and the ratio between the specific type of the cluster information content included in the other cluster set and the included cluster information content are obtained, the information content score corresponding to the other cluster set may specifically be calculated according to the cluster information content included in the other cluster set, the specific type of the cluster information content included in the other cluster set, and the ratio between the specific type of the cluster information content included in the other cluster set and the included cluster information content.
In one embodiment of the present application, as shown in fig. 5, step S420 may specifically include steps S510-S520, which are described as follows:
and S510, determining target calculated values corresponding to other cluster sets according to the cluster information content contained in the other cluster sets.
In this embodiment, the clustering operation may be performed on other cluster sets in consideration of data bias possibly caused by the content of cluster information in the individual cluster sets in the other cluster sets. Specifically, the clustering operation refers to determining target calculation values corresponding to other cluster sets for the other cluster sets, and using the target calculation values to participate in calculation of information content scores corresponding to the other cluster sets. Specifically, the target calculated value corresponding to the other cluster set may be a calculated value determined according to the content of the cluster information contained in the other cluster set, for example, a ratio between the content of the cluster information contained in the other cluster set and a preset numerical value may be used as the target calculated value.
In one embodiment of the present application, according to the content of the cluster information contained in the other cluster set, the manner of determining the target calculated value corresponding to the other cluster set may be: firstly, according to the content of cluster information contained in other cluster sets, determining the numerical value intervals corresponding to the other cluster sets, and then, according to the numerical value intervals corresponding to the other cluster sets and the corresponding relation between the numerical value intervals and the calculated values, determining the target calculated values corresponding to the other cluster sets.
In this embodiment, it is understood that the correspondence between the numerical intervals and the calculated values has been established in advance, and the established correspondence may be as shown in the following table 1:
numerical value interval | Calculated value |
1-100 | A |
100-2000 | B |
2000-5000 | C |
…… | …… |
100000-1000000 | Z |
TABLE 1
For example, assuming that the numerical interval corresponding to the other cluster set 1 is determined to be 100-2000 according to the content of the cluster information contained in the other cluster set 1, the target calculated value of the other cluster set 1 may be determined to be B according to the lookup table 1.
Step S520, calculating the information content scores corresponding to the other cluster sets according to the target calculated values corresponding to the other cluster sets, the specific type of cluster information content contained in the other cluster sets and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content.
After determining the target calculated values corresponding to the other cluster sets, the information content scores corresponding to the other cluster sets can be calculated according to the target calculated values corresponding to the other cluster sets, the specific type of cluster information content contained in the other cluster sets and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content.
In one embodiment of the present application, as shown in fig. 6, step S520 may specifically include step S610 to step S620, which will be specifically described below:
and step S610, carrying out weighted summation on target calculated values corresponding to other cluster sets, the content of the specific type of cluster information contained in other cluster sets and the ratio between the content of the specific type of cluster information contained in other cluster sets and the content of the contained cluster information to obtain an operation result.
Specifically, in order to calculate the information content scores corresponding to other cluster sets, the target calculated values corresponding to other cluster sets, the specific type of cluster information content contained in other cluster sets, and the ratio between the specific type of cluster information content contained in other cluster sets and the contained cluster information content may be weighted and summed to obtain an operation result, that is, the target calculated values are weighted and calculated by using the first weighted values corresponding to the target calculated values to obtain the first operation value; weighting the content of the specific type of clustering information by using a second weighting value corresponding to the content of the specific type of clustering information to obtain a second operation value; and carrying out weighted operation on the ratio by utilizing a second weighted value corresponding to the ratio between the content of the clustering information of the specific type and the content of the clustering information contained to obtain a third operation value, and finally, calculating the sum of the first operation value, the second operation value and the third operation value to obtain an operation result.
And S620, determining the information content scores corresponding to other clustering sets according to the operation result.
In this embodiment, after the weighting operation obtains the operation result, the information content scores corresponding to other cluster sets may be determined according to the operation result. For example, the calculation result obtained by the weighting calculation may be directly used as the information content score corresponding to the other cluster set.
In another embodiment, according to the operation result, determining the information content score corresponding to the other cluster set may further include calculating a sum of the operation result and a preset constant value, and then normalizing the calculated sum to obtain a normalized result value, where the normalized result value is used as the information content score corresponding to the other cluster set.
Specifically, in this embodiment, the information content score S corresponding to other cluster sets may be calculated according to the following formula one and formula two i :
Wherein sigmoid (x) is a function of normalization, lambda 0 Is a value of a preset constant value, and the value is a value of a preset constant,is the target calculated value corresponding to other cluster sets, lambda 1 Is the weight value corresponding to the target calculation value, < +.>Is the content, lambda, of the specific type of cluster information contained in other cluster sets 2 Is the weight value corresponding to the content of the clustering information of the specific type, < >>Is the ratio between the content of the clustering information of a specific type contained in other clustering sets and the content of the clustering information contained, lambda 3 Is the weight value corresponding to the ratio.
With continued reference to fig. 3, step S320 is to screen a cluster set with an information content score greater than or equal to a preset score threshold from other cluster sets, so as to obtain a screened second aggregation.
The information content score is a score which can represent the content of the clustering information contained in other clustering sets, and the higher the information content score is, the more the clustering information content is represented. Therefore, after the information content scores corresponding to the other cluster sets are obtained through the calculation in the step S310, the cluster set with the information content score greater than or equal to the preset score threshold value may be screened from the other cluster sets, so as to obtain the screened second aggregation. In other words, the cluster set with the information content score smaller than the preset score threshold is filtered out from other cluster sets, so that a screened second cluster set is obtained. The preset score threshold may be set according to a specific situation, and the embodiment of the present application is not specifically limited herein.
In one embodiment of the present application, each cluster set has a corresponding identification number (Identity document, ID) for uniquely identifying one cluster set, and the identification number of each cluster set may be reserved while processing is performed on a plurality of cluster sets, where the reserved meaning may be that the identification number of the cluster set that is newly added subsequently may not be repeated, in this embodiment, as shown in fig. 7, step S240 may specifically include steps S710-S720, which are described in detail below:
step S710, obtaining the identification numbers of the clustering sets.
As the number of cluster sets increases, how to effectively manage these cluster sets is important. In this embodiment, the corresponding relationship is generated by mapping the cluster set with the identification number, and the identification number is used for uniquely identifying the cluster set, so that not only is the query on the cluster set facilitated, but also the effective management on the cluster set is realized.
Therefore, in the case where each cluster set has a corresponding identification number, the identification number of each cluster set can be acquired in the process of processing the plurality of cluster sets.
Step S720, the identification numbers of the cluster sets, the first cluster set and the second cluster set are combined into a processing result for a plurality of cluster sets.
Because the first cluster set is a set with the latest update time after a preset time point, and the second cluster set is a set screened according to the content of cluster information contained in other cluster sets, the first cluster set and the second cluster set can be considered as important cluster sets in a plurality of cluster sets to be processed, and the first cluster set and the second cluster set can be directly reserved.
Specifically, in this embodiment, the processing results for the plurality of cluster sets may include the identification of each cluster set, the first cluster set, and the second cluster set.
In an embodiment of the present application, based on the above embodiment, if there is a new cluster set, the identification number of the cluster set may be obtained according to the identification number of each cluster set. Specifically, first, the identification numbers of each cluster set may be obtained from the processing result of the cluster set to be processed, a plurality of identification numbers may be obtained, and further, the identification number of the newly added cluster set may be generated according to the obtained plurality of identification numbers.
In an embodiment, the identification numbers may be incremented according to a time sequence of forming the cluster set, so that the identification number of the newly added cluster set may be incremented based on a maximum identification number of the plurality of identification numbers, specifically, a sum of the maximum identification number and a preset value may be calculated, and the calculated sum value is used as the identification number of the newly added cluster set.
In one embodiment of the present application, according to the content of the cluster information contained in the other cluster sets except the first cluster set in the plurality of cluster sets, the other cluster sets are screened to obtain a screened second cluster set, and the screened second cluster set can be used as a processing result for the plurality of cluster sets, that is, the screened second cluster set can participate in subsequent incremental cluster calculation, and for the cluster sets except the second cluster set in the other cluster sets, in this embodiment, the cluster sets except the second cluster set can be stored in an offline manner. The significance of this is that subsequent backtracking can be facilitated.
FIG. 8 shows a logic diagram of a method for processing a cluster set according to one embodiment of the application, as shown in FIG. 8, the method for processing a cluster set may specifically include the following steps:
S0, acquiring a plurality of cluster sets to be processed.
Therein, as shown in fig. 8A, a plurality of cluster sets are shown acquired, each cluster set having an identification number of id_1, id_2, id_3, id_4, id_5 … … id_19, id_9, id_20, respectively.
S1, determining a first cluster set from a plurality of cluster sets according to the latest update time of each cluster set, wherein the first cluster set is a cluster set with the latest update time after a preset time point.
Referring to FIG. 8B, a first cluster set, including a cluster set identified as id_19, a cluster set identified as id_9, and a cluster set identified as id_20, determined from the latest update times of the respective cluster sets in FIG. 8A is shown in FIG. 8B.
S2, calculating the information content scores corresponding to other cluster sets for other cluster sets except the first cluster set in the plurality of cluster sets.
In this step, the information content scores corresponding to other cluster sets may be calculated. For a specific calculation method, reference may be made to the above-described embodiments of the present application for calculating the information content score.
Meanwhile, referring to fig. 8C and 8D,8C are schematic diagrams of other cluster sets, 8D is a schematic diagram of information content scores corresponding to other cluster sets, and as shown in fig. 8D, by calculating the information content scores of other cluster sets, the information content score of the cluster set with id_1 is 0.37, the information content score of the cluster set with id_2 is 0.29, the information content score of the cluster set with id_3 is 0.33, the information content score of the cluster set with id_4 is 0.99, and the information content score of the cluster set with id_5 is 0.43.
And S3, screening other cluster sets according to the content scores of the information corresponding to the other cluster sets to obtain a screened second cluster set. Specifically, a cluster set with the information content score greater than or equal to a preset score threshold value can be selected from other cluster sets, so that a second selected cluster set is obtained. Wherein the second set of clusters that is screened out can be seen in fig. 8E.
S4, generating processing results for a plurality of cluster sets according to the first cluster set and the second cluster set. See in particular fig. 8F.
Based on the processing method of the clustering set provided by the embodiment of the application, experiments are carried out by using merchant registration data for 5 months, for convenience of description, refer to table 2, and table 2 is a comparison illustration of the speed of incremental clustering by adopting the method provided by the application and not adopting the method provided by the application.
Newly added 5-month information to be clustered | No application of the present application | Application of the application |
Number of existing cluster sets | 50W | 7W |
Clustering of 1W newly added information to be clustered is time-consuming | 73 minutes | 9.6 minutes |
TABLE 2
As can be seen from table 2, when the processing method of the cluster set of the present application is not applied, the number of existing cluster sets reaches 50W, and for the newly added information to be clustered for 5 months, the time consumption of 1W newly added information to be clustered per cluster reaches 73 minutes. After the processing method of the clustering set is applied, the number of the clustering sets can be reduced to 7W, the time consumption of 1W of newly added information to be clustered per cluster is only 9.6 minutes, and the time is reduced by 7.6 times.
Furthermore, since the application only needs to use the cluster set to perform simple calculation, the processing time for performing the cluster set once can be controlled within 2 seconds, and the time consumption can be ignored for the hour-level service, such as malicious merchant group mining service.
The following describes an embodiment of the apparatus of the present application, which may be used to execute the method for processing a cluster set in the foregoing embodiment of the present application. For details not disclosed in the embodiment of the apparatus of the present application, please refer to an embodiment of the method for processing a cluster set according to the present application.
FIG. 9 shows a block diagram of a processing apparatus for clustering sets according to one embodiment of the application.
Referring to fig. 9, a cluster set processing apparatus 900 according to an embodiment of the present application includes: an acquisition unit 902, a determination unit 904, a screening unit 906, and a generation unit 908.
The acquiring unit 902 is configured to acquire a plurality of to-be-processed cluster sets, where each cluster set includes a plurality of cluster information; the determining unit 904 is configured to determine, from the plurality of cluster sets, a first cluster set whose latest update time is after a predetermined point in time according to the latest update time of the respective cluster sets; the screening unit 906 is configured to screen a cluster set with the content of the cluster information greater than or equal to a set threshold according to the content of the cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, so as to obtain a screened second cluster set; the generating unit 908 is configured to generate processing results for the plurality of cluster sets according to the first cluster set and the second cluster set.
In some embodiments of the application, the screening unit 906 includes: a calculating subunit configured to calculate information content scores corresponding to other cluster sets except the first cluster set according to cluster information contents contained in the other cluster sets in the plurality of cluster sets; and the screening subunit is configured to screen the cluster set with the information content score larger than or equal to a preset score threshold value from the other cluster sets to obtain a screened second aggregation.
In some embodiments of the application, the computing subunit is configured to: acquiring the content of the specific type of clustering information contained in the other clustering sets and the ratio between the content of the specific type of clustering information contained in the other clustering sets and the content of the contained clustering information; and calculating the information content scores corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets, the specific type of cluster information content contained in the other cluster sets and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content.
In some embodiments of the application, the computing subunit is configured to: determining target calculated values corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets; and calculating the information content scores corresponding to the other cluster sets according to the target calculated values corresponding to the other cluster sets, the specific type of cluster information content contained in the other cluster sets and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content.
In some embodiments of the application, the computing subunit is configured to: determining a numerical interval corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets; and determining target calculated values corresponding to the other cluster sets according to the numerical value intervals corresponding to the other cluster sets and the corresponding relation between the numerical value intervals and the calculated values.
In some embodiments of the application, the computing subunit is configured to: carrying out weighted summation on target calculated values corresponding to the other cluster sets, the content of the specific type of cluster information contained in the other cluster sets and the ratio between the content of the specific type of cluster information contained in the other cluster sets and the content of the cluster information contained in the other cluster sets to obtain an operation result; and determining the information content scores corresponding to the other clustering sets according to the operation result.
In some embodiments of the application, the computing subunit is configured to: calculating the sum of the operation result and a preset constant value; and carrying out normalization processing on the calculated sum value to obtain a normalized result value, and taking the normalized result value as the information content score corresponding to the other clustering set.
In some embodiments of the application, the apparatus further comprises: the set acquisition unit is configured to acquire a clustering set matched with the information to be clustered from the processing results of the plurality of clustering sets to be processed if the information to be clustered newly added exists; and the adding unit is configured to add the information to be clustered into the matched cluster set.
In some embodiments of the present application, the generating unit 908 includes: an acquisition subunit configured to acquire identification numbers of the respective cluster sets; and the processing subunit is configured to take the identification numbers of the cluster sets, the first cluster set and the second cluster set as processing results for the plurality of cluster sets.
In some embodiments of the application, the apparatus further comprises: an identification number obtaining unit configured to obtain identification numbers of each cluster set from processing results of the plurality of cluster sets to be processed if a newly added cluster set exists, so as to obtain a plurality of identification numbers; an identification number generating unit configured to generate an identification number of the newly added cluster set according to the plurality of identification numbers.
In some embodiments of the application, the identification number generation unit is configured to: acquiring a maximum identification number from the plurality of identification numbers; and calculating the sum of the maximum identification number and a preset value, and taking the calculated sum value as the identification number of the newly added cluster set.
In some embodiments of the application, the apparatus further comprises: and a storage unit configured to store the cluster sets other than the second cluster set in the other cluster sets in an offline manner.
Fig. 10 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.
It should be noted that, the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 10, the computer system 1000 includes a central processing unit (Central Processing Unit, CPU) 1001 that can perform various appropriate actions and processes, such as performing the method described in the above embodiment, according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a random access Memory (Random Access Memory, RAM) 1003. In the RAM 1003, various programs and data required for system operation are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An Input/Output (I/O) interface 1005 is also connected to bus 1004.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.
In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When executed by a Central Processing Unit (CPU) 1001, the computer program performs various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.
Claims (15)
1. A method of processing a collection of clusters, the method comprising:
acquiring a plurality of cluster sets to be processed, wherein each cluster set contains a plurality of cluster information;
determining a first cluster set with the latest updating time after a preset time point from the plurality of cluster sets according to the latest updating time of each cluster set;
screening out a cluster set with the content of the cluster information greater than or equal to a set threshold according to the content of the cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, and obtaining a screened second cluster set;
Generating processing results for the plurality of cluster sets according to the first cluster set and the second cluster set;
wherein, according to the content of the cluster information contained in the other cluster sets except the first cluster set in the plurality of cluster sets, the cluster set with the content of the cluster information larger than or equal to the set threshold value is screened out, and a screened second cluster set is obtained, which comprises:
calculating information content scores corresponding to other cluster sets according to the cluster information content contained in the other cluster sets except the first cluster set in the plurality of cluster sets;
and screening the cluster set with the information content score greater than or equal to a preset score threshold from the other cluster sets to obtain a screened second cluster aggregation.
2. The method of claim 1, wherein calculating the information content scores corresponding to other cluster sets, among the plurality of cluster sets, from the cluster information content contained in the other cluster sets except the first cluster set, comprises:
acquiring the content of the specific type of clustering information contained in the other clustering sets and the ratio between the content of the specific type of clustering information contained in the other clustering sets and the content of the contained clustering information; the specific type of clustering information refers to abnormal clustering information in the clustering information;
And calculating the information content scores corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets, the specific type of cluster information content contained in the other cluster sets and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content.
3. The method of claim 2, wherein calculating the information content scores corresponding to the other cluster sets based on the cluster information content contained in the other cluster sets, the specific type of cluster information content contained in the other cluster sets, and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content, comprises:
determining target calculated values corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets;
and calculating the information content scores corresponding to the other cluster sets according to the target calculated values corresponding to the other cluster sets, the specific type of cluster information content contained in the other cluster sets and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content.
4. A method according to claim 3, wherein determining the target calculated value corresponding to the other cluster set according to the content of the cluster information contained in the other cluster set comprises:
determining a numerical interval corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets;
and determining target calculated values corresponding to the other cluster sets according to the numerical value intervals corresponding to the other cluster sets and the corresponding relation between the numerical value intervals and the calculated values.
5. A method according to claim 3, wherein calculating the information content score corresponding to the other cluster set based on the target calculated value corresponding to the other cluster set, the specific type of cluster information content contained in the other cluster set, and the ratio between the specific type of cluster information content contained in the other cluster set and the contained cluster information content, comprises:
carrying out weighted summation on target calculated values corresponding to the other cluster sets, the content of the specific type of cluster information contained in the other cluster sets and the ratio between the content of the specific type of cluster information contained in the other cluster sets and the content of the cluster information contained in the other cluster sets to obtain an operation result;
And determining the information content scores corresponding to the other clustering sets according to the operation result.
6. The method of claim 5, wherein determining the information content scores corresponding to the other cluster sets based on the operation result comprises:
calculating the sum of the operation result and a preset constant value;
and carrying out normalization processing on the calculated sum value to obtain a normalized result value, and taking the normalized result value as the information content score corresponding to the other clustering set.
7. The method according to any one of claims 1 to 6, further comprising:
if the newly added information to be clustered exists, a clustering set matched with the information to be clustered is obtained from the processing results of a plurality of clustering sets to be processed;
and adding the information to be clustered into the matched cluster set.
8. The method of any of claims 1 to 6, wherein generating processing results for the plurality of cluster sets from the first cluster set and the second cluster set comprises:
acquiring the identification numbers of the clustering sets;
And combining the identification numbers of the cluster sets, the first cluster set and the second cluster set into a processing result aiming at the cluster sets.
9. The method of claim 8, wherein the method further comprises:
if a newly added cluster set exists, the identification numbers of the cluster sets are obtained from the processing results of the cluster sets to be processed, and a plurality of identification numbers are obtained;
and generating the identification numbers of the newly added cluster set according to the identification numbers.
10. The method of claim 9, wherein generating the identification number of the newly added cluster set from the plurality of identification numbers comprises:
acquiring a maximum identification number from the plurality of identification numbers;
and calculating the sum of the maximum identification number and a preset value, and taking the calculated sum value as the identification number of the newly added cluster set.
11. The method according to any one of claims 1 to 6, further comprising:
and storing the cluster sets except the second cluster set in the other cluster sets in an offline mode.
12. A cluster set processing apparatus, the apparatus comprising:
The acquisition unit is configured to acquire a plurality of cluster sets to be processed, wherein each cluster set contains a plurality of cluster information;
a determining unit configured to determine a first cluster set, from the plurality of cluster sets, from which the latest update time is after a predetermined point in time, according to the latest update time of the respective cluster sets;
a screening unit, configured to screen out a cluster set with the content of the cluster information greater than or equal to a set threshold according to the content of the cluster information contained in other cluster sets except the first cluster set in the plurality of cluster sets, so as to obtain a screened second cluster set;
a generation unit configured to generate processing results for the plurality of cluster sets according to the first cluster set and the second cluster set;
wherein, the screening unit includes:
a calculating subunit configured to calculate information content scores corresponding to other cluster sets except the first cluster set according to cluster information contents contained in the other cluster sets in the plurality of cluster sets;
and the screening subunit is configured to screen the cluster set with the information content score larger than or equal to a preset score threshold value from the other cluster sets to obtain a screened second aggregation.
13. The apparatus of claim 12, wherein the computing subunit is configured to:
acquiring the content of the specific type of clustering information contained in the other clustering sets and the ratio between the content of the specific type of clustering information contained in the other clustering sets and the content of the contained clustering information; the specific type of clustering information refers to abnormal clustering information in the clustering information;
and calculating the information content scores corresponding to the other cluster sets according to the cluster information content contained in the other cluster sets, the specific type of cluster information content contained in the other cluster sets and the ratio between the specific type of cluster information content contained in the other cluster sets and the contained cluster information content.
14. A computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements a method of processing a cluster set according to any one of claims 1 to 11.
15. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of processing a clustered collection as claimed in any one of claims 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110261725.6A CN113010759B (en) | 2021-03-10 | 2021-03-10 | Cluster set processing method and device, computer readable medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110261725.6A CN113010759B (en) | 2021-03-10 | 2021-03-10 | Cluster set processing method and device, computer readable medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113010759A CN113010759A (en) | 2021-06-22 |
CN113010759B true CN113010759B (en) | 2023-10-27 |
Family
ID=76404482
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110261725.6A Active CN113010759B (en) | 2021-03-10 | 2021-03-10 | Cluster set processing method and device, computer readable medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113010759B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11861870B2 (en) * | 2021-07-23 | 2024-01-02 | The Boeing Company | Rapid object detection for vehicle situational awareness |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110209808A (en) * | 2018-08-08 | 2019-09-06 | 腾讯科技(深圳)有限公司 | A kind of event generation method and relevant apparatus based on text information |
CN112367338A (en) * | 2020-11-27 | 2021-02-12 | 腾讯科技(深圳)有限公司 | Malicious request detection method and device |
-
2021
- 2021-03-10 CN CN202110261725.6A patent/CN113010759B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110209808A (en) * | 2018-08-08 | 2019-09-06 | 腾讯科技(深圳)有限公司 | A kind of event generation method and relevant apparatus based on text information |
CN112367338A (en) * | 2020-11-27 | 2021-02-12 | 腾讯科技(深圳)有限公司 | Malicious request detection method and device |
Also Published As
Publication number | Publication date |
---|---|
CN113010759A (en) | 2021-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11443212B2 (en) | Learning policy explanations | |
US11106999B2 (en) | Automatic segmentation of a collection of user profiles | |
CN111932386B (en) | User account determining method and device, information pushing method and device, and electronic equipment | |
CN111339443B (en) | User label determination method and device, computer equipment and storage medium | |
CN109471978B (en) | Electronic resource recommendation method and device | |
CN113761153B (en) | Picture-based question-answering processing method and device, readable medium and electronic equipment | |
WO2021155691A1 (en) | User portrait generating method and apparatus, storage medium, and device | |
CN110807474A (en) | Clustering method and device, storage medium and electronic equipment | |
CN112948561A (en) | Method and device for automatically expanding question-answer knowledge base | |
CN112418320A (en) | Enterprise association relation identification method and device and storage medium | |
CN115730597A (en) | Multi-level semantic intention recognition method and related equipment thereof | |
CN113010759B (en) | Cluster set processing method and device, computer readable medium and electronic equipment | |
CN113837307A (en) | Data similarity calculation method and device, readable medium and electronic equipment | |
CN110197078B (en) | Data processing method and device, computer readable medium and electronic equipment | |
CN111667018B (en) | Object clustering method and device, computer readable medium and electronic equipment | |
CN112926341A (en) | Text data processing method and device | |
CN111325578A (en) | Prediction model sample determination method, prediction model sample determination device, prediction model sample determination medium, and prediction model sample determination device | |
CN116957006A (en) | Training method, device, equipment, medium and program product of prediction model | |
CN111274818B (en) | Word vector generation method and device | |
CN114268625B (en) | Feature selection method, device, equipment and storage medium | |
CN115099875A (en) | Data classification method based on decision tree model and related equipment | |
Zarzour et al. | An efficient recommender system based on collaborative filtering recommendation and cluster ensemble | |
CN110472140B (en) | Object word recommendation method and device and electronic equipment | |
CN114461822A (en) | Resource processing method, device, equipment and storage medium | |
CN115115075A (en) | Risk identification method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40045926 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant |