CN112508134A - Method, device, medium and electronic equipment for measuring similarity between sets - Google Patents

Method, device, medium and electronic equipment for measuring similarity between sets Download PDF

Info

Publication number
CN112508134A
CN112508134A CN202110138928.6A CN202110138928A CN112508134A CN 112508134 A CN112508134 A CN 112508134A CN 202110138928 A CN202110138928 A CN 202110138928A CN 112508134 A CN112508134 A CN 112508134A
Authority
CN
China
Prior art keywords
cluster
sample
samples
value
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110138928.6A
Other languages
Chinese (zh)
Other versions
CN112508134B (en
Inventor
李嘉晨
郭凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seashell Housing Beijing Technology Co Ltd
Original Assignee
Seashell Housing Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Seashell Housing Beijing Technology Co Ltd filed Critical Seashell Housing Beijing Technology Co Ltd
Priority to CN202110138928.6A priority Critical patent/CN112508134B/en
Publication of CN112508134A publication Critical patent/CN112508134A/en
Application granted granted Critical
Publication of CN112508134B publication Critical patent/CN112508134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method, an apparatus, a medium, and an electronic device for measuring inter-set similarity are disclosed. The method comprises the following steps: obtaining sample vectors of all samples in the first set and the second set; taking all samples in the first set and the second set as samples in the cluster to be processed; determining an interval value for cluster division of the cluster to be processed according to the value of one-dimensional vector of the sample vector; performing cluster division processing on samples in the cluster to be processed by using the partition value to obtain at least one new cluster; determining the current division state according to the respective sets of the samples contained in the new clusters; if the current division state does not accord with the preset division stop state, updating the cluster to be processed according to the new cluster, and executing cluster division processing again; and if the cluster partition processing is in accordance with the preset partition stop state, determining the similarity between the first set and the second set according to the number of times of cluster partition processing which is executed currently. The method and the device can realize controllable precision of the similarity between the sets.

Description

Method, device, medium and electronic equipment for measuring similarity between sets
Technical Field
The present disclosure relates to computer technologies, and in particular, to a method for measuring inter-set similarity, an apparatus for measuring inter-set similarity, a storage medium, and an electronic device.
Background
The existing methods for measuring similarity between sets mainly include the following two methods:
the first mode is a measurement mode based on probability. The method needs to determine the probability distribution of the samples in the two sets, and then determine the similarity between the two sets based on the probability distribution of the samples in the two sets.
Mode two, a spatial-based metric mode. The method comprises the steps of firstly mapping samples in two sets to a space respectively, then determining typical samples in the two sets based on the space, using the typical samples as calculation items, executing distance calculation by using the obtained calculation items, and determining the similarity between the two sets based on the distance obtained by calculation.
Samples obtained based on scenes such as low-frequency interaction and the like often have the characteristic that values of partial dimensions and even all dimensions are small, and the confidence coefficient of probability distribution formed by the samples is often low, so that the accuracy of similarity between two sets is influenced. In addition, the computation effort for mode one tends to be higher when the dimensionality of the samples in the set is higher. Furthermore, the approach does not enable control of the accuracy of the similarity between the two sets.
The implementation complexity (such as the amount of calculation) of the second method is often strongly correlated with the number of samples in the set, and the second method is often not well applicable in some low-precision application scenarios. Furthermore, the second method also cannot achieve the control of the accuracy of the similarity between the two sets.
How to determine the similarity between two sets under the condition that the precision of the similarity is controllable is a technical problem worthy of attention.
Disclosure of Invention
The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a method for measuring similarity among sets, a device for measuring similarity among sets, a storage medium and an electronic device.
According to an aspect of the embodiments of the present disclosure, there is provided a method for measuring similarity between sets, the method including: obtaining sample vectors of all samples in the first set and the second set; taking all samples in the first set and the second set as samples in a cluster to be processed; determining an interval value for cluster division of the cluster to be processed at this time according to the value of one-dimensional vector of the sample vector; performing cluster division processing on the samples in the cluster to be processed by using the partition value to obtain at least one new cluster; determining the current division state according to the respective sets of the samples contained in the new clusters; if the current division state does not accord with a preset division stop state, updating the cluster to be processed according to the new cluster, returning to the value of one-dimensional vector of the sample vector, and determining the partition value for cluster division of the cluster to be processed this time; and if the current dividing state accords with a preset dividing stop state, determining the similarity of the first set and the second set according to the number of times of cluster dividing processing which is executed currently.
In an embodiment of the present disclosure, the obtaining a sample vector of each sample in the first set and a sample vector of each sample in the second set includes: performing dimension compression processing on at least part of dimension vectors of each sample in the first set and the second set to obtain a multi-dimension compressed vector of each sample; and if at least one-dimensional vectors which are not subjected to compression processing exist in the samples, splicing the multi-dimensional compressed vectors of the samples with the dimensional vectors which are not subjected to compression processing in the samples respectively to obtain sample vectors of the samples in the first set and sample vectors of the samples in the second set.
In an embodiment of the present disclosure, the obtaining a sample vector of each sample in the first set and a sample vector of each sample in the second set includes: for any one dimension enumeration type vector in the samples in the first set and the second set, converting the dimension enumeration type vector in each sample in the first set and the second set into a multi-dimension one-hot type vector respectively, and setting a numerical value of the multi-dimension one-hot type vector of each sample according to an enumeration value of the dimension enumeration type vector in each sample in the first set and the second set.
In an embodiment of the present disclosure, the determining, according to a value of a one-dimensional vector of the sample vector, an interval value for cluster division of the cluster to be processed this time includes: selecting a one-dimensional vector from the sample vectors based on a random selection mode, determining a value range formed by the values of the selected one-dimensional vector, selecting one value from the value range based on the random selection mode, and taking the selected value as an interval value for cluster division of the cluster to be processed; or selecting a one-dimensional vector from the sample vectors based on a random selection mode, selecting a value from all values of the one-dimensional vector based on the random selection mode, and taking the selected value as an interval value for cluster division of the cluster to be processed.
In an embodiment of the present disclosure, the performing cluster partition processing on the samples in the cluster to be processed by using the partition value to obtain at least one new cluster includes: for any sample in any cluster to be processed, if the magnitude relation between the value of the corresponding dimension vector of the sample and the partition value meets a preset magnitude relation, determining that the sample is divided into a new cluster under the cluster to be processed, otherwise, determining that the sample is divided into another new cluster under the cluster to be processed.
In an embodiment of the present disclosure, the determining a current partitioning state according to a set to which each sample included in each new cluster belongs includes at least one of: under the condition that all samples contained in one of the new clusters belong to the same set, determining that a single cluster exists in the cluster dividing processing result, and executing accumulated counting operation aiming at the single cluster to obtain a first accumulated time; under the condition that all samples contained in one of the new clusters belong to the same set, determining that single clusters exist in the cluster dividing processing result, and executing the cumulative counting operation of the number of the single clusters to obtain a second cumulative number; under the condition that all samples contained in one of the new clusters belong to the same set, determining that single clusters exist in the cluster dividing processing result, and obtaining the single cluster proportion according to the ratio of the number of the single clusters to the number of the new clusters; determining the sample bias value of each new cluster according to the set to which all samples contained in each new cluster belong; wherein at least one of the first accumulation number, the second accumulation number, the single-class cluster proportion and the sample bias value is used for forming the current division state.
In an embodiment of the present disclosure, the determining a current partitioning state according to a set to which each sample included in each new cluster belongs further includes: counting the currently executed cluster division processing to obtain a third accumulated time; wherein the current partitioning state comprises: and a third cumulative number.
In an embodiment of the present disclosure, the current partitioning state does not conform to a preset partitioning stop state, and includes at least one of: the size relation between the first accumulation frequency and the first preset accumulation frequency does not conform to the first preset size relation, and the third accumulation frequency does not reach the preset maximum cluster dividing frequency; the size relation between the second accumulation frequency and the second preset accumulation frequency does not conform to the second preset size relation, and the third accumulation frequency does not reach the preset maximum cluster dividing frequency; the size relation between the single cluster proportion and the preset single cluster proportion does not conform to a third preset size relation, and the third accumulated times does not reach the preset maximum cluster dividing times; and the size relation between the average value and the preset deviation value does not conform to a fourth preset size relation, and the third cumulative frequency does not reach the preset maximum cluster dividing frequency.
In an embodiment of the present disclosure, the updating the cluster to be processed according to each new cluster includes: determining non-single clusters in all new clusters obtained by executing cluster division processing this time, and respectively taking all the non-single clusters as clusters to be processed; wherein the non-single cluster is a new cluster containing both the samples in the first set and the samples in the second set.
In an embodiment of the present disclosure, the determining a similarity between the first set and the second set according to the number of times cluster division processing has been currently performed includes: and taking the reciprocal of the number of times of the cluster division processing which has been currently performed as the similarity of the first set and the second set.
According to another aspect of the embodiments of the present disclosure, there is provided an apparatus for measuring similarity between sets, including: a sample vector obtaining module, configured to obtain a sample vector for each sample in the first set and the second set; determining a cluster to be processed module, which is used for taking all samples in the first set and the second set as samples in the cluster to be processed; a partition value determining module, configured to determine a partition value for cluster division on the cluster to be processed this time according to a value of one-dimensional vector of the sample vector; the cluster dividing processing module is used for performing cluster dividing processing on the samples in the cluster to be processed by utilizing the partition value to obtain at least one new cluster; a state determining module, configured to determine a current partitioning state according to a set to which each sample included in each new cluster belongs; the cycle control module is used for updating the cluster to be processed according to the new cluster if the current division state does not accord with a preset division stop state, and triggering the partition value determining module to execute the operation of determining the partition value for cluster division on the cluster to be processed according to the value of one-dimensional vector of the sample vector; and the similarity determining module is used for determining the similarity of the first set and the second set according to the number of times of cluster division processing which is executed currently if the current division state accords with a preset division stop state.
In an embodiment of the present disclosure, the module for obtaining a sample vector includes: and the first submodule is used for executing dimension compression processing on at least part of dimension vectors of each sample in the first set and the second set to obtain a multi-dimensional compressed vector of each sample, and if at least one dimension vector which is not executed with compression processing exists in the sample, splicing the multi-dimensional compressed vector of each sample with each dimension vector which is not executed with compression processing in each sample respectively to obtain a sample vector of each sample in the first set and a sample vector of each sample in the second set.
In an embodiment of the present disclosure, the module for obtaining a sample vector includes: and a second sub-module, configured to convert, for any one of the dimension enumeration type vectors in the samples in the first set and the second set, the dimension enumeration type vector in each sample in the first set and the second set into a multidimensional one-hot type vector, and set a numerical value of the multidimensional one-hot type vector of each sample according to an enumeration value of the dimension enumeration type vector in each sample in the first set and the second set.
In an embodiment of the present disclosure, the determining the partition value module includes: a third sub-module, configured to select a one-dimensional vector from the sample vectors based on a random selection manner, determine a value range formed by values of the selected one-dimensional vector, select a value from the value range based on the random selection manner, and use the selected value as an interval value for cluster division of the cluster to be processed; or, the fourth sub-module is configured to select a one-dimensional vector from the sample vectors based on a random selection manner, select one value from all values of the one-dimensional vector based on the random selection manner, and use the selected value as an interval value for cluster division of the cluster to be processed.
In an embodiment of the present disclosure, the cluster division processing module is further configured to: for any sample in any cluster to be processed, if the magnitude relation between the value of the corresponding dimension vector of the sample and the partition value meets a preset magnitude relation, determining that the sample is divided into a new cluster under the cluster to be processed, otherwise, determining that the sample is divided into another new cluster under the cluster to be processed.
In an embodiment of the present disclosure, the status determining module includes at least one of: a fifth sub-module, configured to determine that a single cluster exists in a cluster division processing result of this time under a condition that it is determined that all samples included in one of the new clusters belong to the same set, and perform an accumulation counting operation for the single cluster, so as to obtain a first accumulation number; a sixth sub-module, configured to determine that a single cluster exists in a current cluster division processing result under the condition that it is determined that all samples included in one of the new clusters belong to the same set, and perform an accumulated counting operation of the number of the single clusters to obtain a second accumulated number; a seventh sub-module, configured to determine that a single-class cluster exists in a current cluster division processing result when it is determined that all samples included in one of the new clusters belong to the same set, and obtain a single-class cluster proportion according to a ratio of the number of the single-class clusters to the number of the new clusters; the eighth submodule is used for determining the sample bias value of each new cluster according to the set to which all the samples contained in each new cluster belong; wherein at least one of the first accumulation number, the second accumulation number, the single-class cluster proportion and the sample bias value is used for forming the current division state.
In an embodiment of the present disclosure, the determining the status module further includes: a ninth sub-module, configured to count times of cluster division processing that has been currently executed, to obtain a third cumulative number of times; wherein the current partitioning state comprises: and a third cumulative number.
In an embodiment of the present disclosure, the current partitioning state does not conform to a preset partitioning stop state, and includes at least one of: the size relation between the first accumulation frequency and the first preset accumulation frequency does not conform to the first preset size relation, and the third accumulation frequency does not reach the preset maximum cluster dividing frequency; the size relation between the second accumulation frequency and the second preset accumulation frequency does not conform to the second preset size relation, and the third accumulation frequency does not reach the preset maximum cluster dividing frequency; the size relation between the single cluster proportion and the preset single cluster proportion does not conform to a third preset size relation, and the third accumulated times does not reach the preset maximum cluster dividing times; and the size relation between the average value and the preset deviation value does not conform to a fourth preset size relation, and the third cumulative frequency does not reach the preset maximum cluster dividing frequency.
In an embodiment of the present disclosure, the loop control module is further configured to: determining non-single clusters in all new clusters obtained by executing cluster division processing this time, and respectively taking all the non-single clusters as clusters to be processed; wherein the non-single cluster is a new cluster containing both the samples in the first set and the samples in the second set.
In an embodiment of the present disclosure, the determine similarity module is further configured to: and taking the reciprocal of the number of times of the cluster division processing which has been currently performed as the similarity of the first set and the second set.
According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the method for measuring similarity between sets.
According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is used for reading the executable instruction from the memory and executing the instruction to realize the method for measuring the similarity among the sets.
Based on the method and apparatus for measuring similarity between sets provided by the foregoing embodiments of the present disclosure, as the present disclosure performs cluster division processing on all samples in a first set and a second set by using a value of a one-dimensional vector of a sample vector, if samples in the first set and the second set are more similar, the number of samples from different sets included in each new cluster obtained by the cluster division processing is more similar, and as the number of samples included in a new cluster generated by each cluster division processing gradually decreases, the probability that a new cluster includes samples from the same set is higher and higher, therefore, the present disclosure forms a current division state by sets to which each sample included in each new cluster belongs, and by setting a preset division stop state, the number of times that the present disclosure repeatedly performs cluster division processing can be controlled; since the similarity of the first set and the second set is related to the number of times of performing cluster division processing, the present disclosure can control the number of times of performing cluster division processing by presetting a division stop state, so that the accuracy of measuring the similarity between the sets is controlled; in addition, due to the fact that computing resources consumed by processes such as cluster division processing are very limited, measurement of similarity among sets can be achieved with low cost under the condition that the precision of the similarity among the sets is controllable. Therefore, the technical scheme provided by the disclosure is beneficial to improving the flexibility of the similarity among the height sets and reducing the implementation cost of the similarity among the height sets, thereby being beneficial to improving the usability of the similarity among the height sets.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of one embodiment of a method of measuring inter-set similarity according to the present disclosure;
2A-2C are schematic diagrams illustrating an embodiment of inter-metric similarity sets according to the present disclosure;
FIG. 3 is a schematic diagram illustrating an embodiment of an apparatus for measuring similarity between sets according to the present disclosure;
fig. 4 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more than two and "at least one" may refer to one, two or more than two.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, such as a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Embodiments of the present disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with an electronic device, such as a terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Summary of the disclosure
In the process of implementing the present disclosure, the inventors found that in some application scenarios, it is necessary to obtain the similarity between different sets very accurately, while in other application scenarios, it is often only necessary to compare the similarities between different sets, and in the application scenario, it seems not necessary to obtain the similarity between different sets very accurately.
According to the existing technical scheme for measuring the similarity between sets, because the precision of the similarity between different sets cannot be controlled and high precision is accompanied with high calculation amount, the existing technical scheme for measuring the similarity between sets cannot be well applicable to some application scenarios.
If the controllable precision of the similarity can be realized in the process of measuring the similarity among the sets, the method can be better applied to application scenes which need to obtain the similarity among different sets very accurately or application scenes which need to compare the similarity among different sets.
Brief description of the drawings
An example of an application scenario of the technique for measuring inter-set similarity provided by the present disclosure is as follows:
suppose there are n users in the system, namely user 1, user 2, … …, and user n. The sample vectors of the user samples of these n users each comprise m-dimensional vectors, vector 1, vector 2 … …, and vector m.
Suppose the above n users belong to three different sets of users, for example, user 1, user 2, … … and user n1 belong to a first set of users, user n1+1, user n1+2, … … and user n2 belong to a second set of users, and user n2+1, user n2+2, … … and user n belong to a third set of users.
In an application scenario that it needs to be determined that a first user set is more similar to a second user set or a first user set is more similar to a third user set, by setting a preset division stop state and based on m-dimensional vectors contained in sample vectors of n users, the similarity between the first user set and the second user set and the similarity between the first user set and the third user set can be obtained relatively roughly by using the technical scheme provided by the present disclosure, so that by comparing the two similarities, it can be determined rapidly that the first user set is more similar to the second user set or the first user set is more similar to the third user set.
In an application scenario where the similarity between three different user sets needs to be accurately obtained, the preset division stop state is set, and based on the m-dimensional vectors contained in the sample vectors of the n users, the technical scheme provided by the present disclosure can be used to obtain the similarity between the first user set and the second user set, the similarity between the first user set and the third user set, and the similarity between the second user set and the third user set with corresponding accuracy, so that the accuracy of the similarity between the sets can meet the requirements of the actual application scenario as much as possible.
Exemplary method
FIG. 1 is a flow chart of one embodiment of a method for measuring inter-set similarity according to the present disclosure.
The method of the embodiment shown in fig. 1 comprises the steps of: s100, S101, S102, S103, S104, S105, and S106. The above steps will be described below.
S100, obtaining a sample vector of each sample in the first set and the second set.
A sample in this disclosure may be considered a record for describing a target object from multiple dimensions. The target object can be a user or a house source or a commodity and the like. The sample vectors of samples in the present disclosure typically comprise multidimensional vectors. The present disclosure may obtain a sample vector for each sample in the first set and a sample vector for each sample in the second set by reading information from a storage location, such as a data warehouse. In this disclosure, the sample vector of each sample in the first set and the second set may be a sample vector of an original sample, or may be a sample vector obtained by preprocessing a sample vector of an original sample.
Each sample in the present disclosure may have a respective set identification to indicate whether the sample belongs to a sample of the first set or a sample of the second set.
And S101, taking all samples in the first set and the second set as samples in the cluster to be processed.
The cluster to be processed in the present disclosure may also be referred to as a current cluster to be divided. The cluster to be processed formed by all samples in the first set and all samples in the second set may be referred to as an initial cluster to be processed. In the subsequent flow, the cluster to be processed is formed again, and all the samples included in the cluster to be processed formed again are usually a part of all the samples included in the original cluster to be processed.
And S102, determining the partition value for cluster division of the cluster to be processed according to the value of one-dimensional vector of the sample vector.
All of the dimension vectors in the sample vectors in this disclosure are typically numeric vectors. The cluster partition method and the cluster partition device can select the one-dimensional vector from the sample vectors according to a preset selection rule, and determine the partition value for cluster partition of the cluster to be processed according to the value of the selected vector in all samples in the cluster to be processed. That is, the present disclosure may obtain a value according to the value of the selected vector in all samples in the cluster to be processed, and use the value as the partition value for cluster division of the cluster to be processed this time. In addition, the preset selection rule can be set according to actual requirements. In the present disclosure, a value of a one-dimensional vector of a sample vector may be one of values in a value interval formed by all values belonging to the one-dimensional vector.
The partition value in the present disclosure is usually a specific value, and the value is used to perform cluster division on all the clusters to be processed, so that each cluster to be processed is divided into one or two new clusters. The cluster division in the present disclosure may also be considered as grouping all samples in the cluster to be processed, and the grouping is based on the size of the partition value and the value of the vector of the corresponding dimension in all samples in the cluster to be processed.
S103, cluster division processing is carried out on samples in the cluster to be processed by utilizing the partition value, and at least one new cluster is obtained.
Any new cluster in the present disclosure is typically a subset of the corresponding cluster to be processed. For example, all samples contained in a new cluster are typically part of the samples in the corresponding cluster to be processed. As another example, all samples contained in a new cluster may be all samples in the corresponding cluster to be processed.
For any cluster to be processed, the present disclosure may divide the cluster to be processed into two parts by comparing the partition value with the values of the vectors of the corresponding dimensions in all samples included in the cluster to be processed, and each part is a new cluster. Of course, in the present disclosure, by comparing the partition value with the values of the vectors of the corresponding dimensions in all the samples included in the cluster to be processed, there is also a case where all the samples in the cluster to be processed belong to the same new cluster, that is, the cluster to be processed is the new cluster obtained after the cluster division processing is executed this time.
The cluster division processing result of the present disclosure can be realized by labeling the sample correspondingly, for example, for any sample, in case that the sample belongs to a new cluster, a cluster identifier of the new cluster can be set for the sample. In addition, the present disclosure can also set the number of cluster division processes that the sample undergoes for the sample.
And S104, determining the current division state according to the set to which each sample contained in each new cluster belongs.
The current division state in the present disclosure may refer to information indicating a case where cluster division processing has been performed for the clusters to be processed formed by the first set and the second set. The present disclosure may represent the current partitioning state from at least one index, and in one example, the present disclosure generally uses two or more indices to represent the current partitioning state.
For any new cluster in the present disclosure, if all samples contained in the new cluster belong to the first set or belong to the second set, the present disclosure may consider the new cluster as a single-class cluster; if a portion of all samples contained in the new cluster belong to the first set and another portion belong to the second set, the present disclosure may consider the new cluster as a non-single class cluster. The method and the device can perform statistics in various forms aiming at the set to which the samples contained in the new cluster belong, and obtain the current dividing state after the cluster dividing processing according to the statistical result. The present disclosure may also set an identifier that the sample belongs to a single-class cluster or an identifier that the sample belongs to a non-single-class cluster for all samples in each new cluster.
And S105, if the determined current division state does not accord with the preset division stop state, updating the cluster to be processed according to each new cluster, and returning to the step S102.
The preset division stop state in the present disclosure may refer to information for controlling whether to continue the cluster division process, and may be regarded as information for controlling the accuracy of the similarity between the sets of metrics. The method and the device can set the corresponding preset division stop state according to the precision requirement of the similarity between the sets by the actual application scene, so that the precision of the similarity between the first set and the second set finally obtained by the method and the device can meet the precision requirement of the similarity between the sets by the actual application scene as far as possible.
The current division state not complying with the preset division stop state in the present disclosure may be considered as the current division state not satisfying the requirement of stopping performing cluster division on the current division state. When each new cluster is used for updating the cluster to be processed, all the new clusters can be respectively used as the cluster to be processed, and part of the new clusters in all the new clusters can also be used as the clusters to be processed.
And S106, if the determined current dividing state accords with a preset dividing stop state, determining the similarity of the first set and the second set according to the number of times of cluster dividing processing which is executed currently.
The current partitioning state in the present disclosure conforming to the preset partitioning stop state may be regarded as the current partitioning state satisfying the requirement of stopping performing cluster partitioning on the current partitioning state. The present disclosure may perform corresponding arithmetic processing for the number of times cluster division processing has been currently performed, and take the result of the arithmetic processing as the similarity of the first set and the second set.
Because the cluster division processing is performed on all samples in the first set and all samples in the second set by utilizing the values of one-dimensional vectors of the sample vectors, if the samples in the first set and the second set are more similar, the number of the samples from different sets contained in each new cluster obtained by the cluster division processing is more similar, and along with the gradual reduction of the number of the samples contained in the new cluster generated by each cluster division processing, the probability that the new cluster contains the samples from the same set is higher and higher, the current division state is formed by the sets to which the samples contained in each new cluster belong, and the times of repeatedly performing the cluster division processing by the present disclosure can be controlled by setting the preset division stop state; since the similarity of the first set and the second set is related to the number of times of performing cluster division processing, the present disclosure can control the number of times of performing cluster division processing by presetting a division stop state, so that the accuracy of measuring the similarity between the sets is controlled; in addition, due to the fact that computing resources consumed by processes such as cluster division processing are very limited, measurement of similarity among sets can be achieved with low cost under the condition that the precision of the similarity among the sets is controllable. Therefore, the technical scheme provided by the disclosure is beneficial to improving the flexibility of the similarity among the height sets and reducing the implementation cost of the similarity among the height sets, thereby being beneficial to improving the usability of the similarity among the height sets.
In an optional example, all samples taken from the first set and the second set by the present disclosure may be samples after performing the dimension compression process. That is, at least part of the dimensional vectors in all samples finally acquired from the first set and the second set by the present disclosure are vectors obtained after the dimension compression processing.
In one example, part of the dimensional vectors in the samples in the first set and the second set belong to high-dimensional sparse vectors (e.g., part of the dimensional vectors meet a preset high-dimensional sparse vector condition), and the present disclosure may perform dimensional compression processing (i.e., low-dimensional dense vector transformation processing) on the part of the high-dimensional sparse vectors in each sample, and concatenate the low-dimensional dense vectors obtained after the dimensional compression processing with dimensional vectors in corresponding samples that do not perform dimensional compression processing (e.g., part of the dimensional vectors that do not meet the preset high-dimensional sparse vector condition), thereby obtaining a sample vector of each sample obtained from the first set and a sample vector of each sample obtained from the second set. In a more specific example, the sample vectors for the samples in the first set and the second set are both a-dimensional vectors, wherein the first to a1 th (a 1 is an integer less than a) dimensional vectors conform to the preset high-dimensional sparse vector condition, and thus, is considered as a high-dimensional sparse vector, and the a1+1 st-dimensional vector to the a-th-dimensional vector do not conform to the preset high-dimensional sparse vector condition, and thus considered as a low-dimensional sparse vector (e.g., a vector representing the behavior quantity characteristics of the user, etc.), for any sample, the present disclosure may perform a dimension compression process on the first-dimension to a-1-dimension vectors in the sample, obtain a-2 (a 2 is smaller than a 1) -dimension vectors, and the a 2-dimensional vector is spliced with the a1+ 1-dimensional vector to the a-dimensional vector of the sample to obtain a (a 2+ a1+ 1) -dimensional vector, the (a 2+ a1+ 1) -dimensional vector can be used as a sample vector of a sample finally obtained from the set by the present disclosure.
In another example, all the dimensional vectors in the samples in the first set and the second set belong to high-dimensional sparse vectors (e.g., all the dimensional vectors conform to a preset high-dimensional sparse vector condition), and the present disclosure may perform a dimensional compression process on all the dimensional vectors in each sample, and use a low-dimensional dense vector of all the samples obtained after the dimensional compression process as a sample vector of each sample finally obtained from the set by the present disclosure. More specifically, the sample vectors of the samples in the first set and the second set are both a-dimensional vectors, and the a-dimensional vectors both meet the preset high-dimensional sparse vector condition, and thus are both considered as high-dimensional sparse vectors, and for any sample, the present disclosure may perform a dimension compression process on the a-dimensional vectors in the samples to obtain a3 (a 3 is smaller than a) dimensional vectors, and the a3 dimensional vectors may be used as sample vectors of a sample finally obtained from the sets by the present disclosure.
Optionally, the present disclosure may implement the dimension compression processing by using techniques such as a neural network. For example, the present disclosure may provide a high-dimensional sparse vector of samples to an encoder and obtain a dimension compression processing result, i.e., a low-dimensional dense vector, from the output of the encoder. As another example, the present disclosure may provide a high-dimensional sparse vector of samples to an MLP (Multi-layer perceptron), and obtain a dimension compression processing result according to an output of a hidden layer in the MLP. This disclosure is not limited to a particular implementation of implementing the dimension compression process.
Optionally, the present disclosure may update the samples in the first set and the samples in the second set by using the samples obtained after performing the dimension compression and the stitching, that is, the updated samples in the first set and the second set are samples that may be directly used in the subsequent process.
According to the method, the dimensionality compression processing is carried out on the vectors of at least dimensionalities in the sample, so that the phenomenon that high-dimensional sparse vectors exist in the sample is favorably avoided, the influence of the high-dimensional sparse vectors in the sample on a subsequent processing flow is favorably avoided, and the efficiency of improving the similarity between height sets is favorably improved.
In an alternative example, the partial dimensional vector of each sample obtained from the first set and the second set in the present disclosure may be a type-converted vector. In one example, vectors of all dimensions of each sample obtained from the first set and the second set are vectors of a preset type, and if sample vectors of samples in the first set and the second set contain vectors of a non-preset type, such vectors should be converted into vectors of a preset type. More specifically, the non-preset type may include an enumeration type and the like, the preset type may include a one-hot type and the like, and the one-hot type may also be considered as a numerical type, and in one example, the present disclosure may convert each enumeration type vector in the sample into a one-hot type vector.
Alternatively, one example of a vector of numeric types in a sample vector of samples in this disclosure: a sample vector of samples comprises at least: the number of rooms of the house source, the price of the house source, and the area of the house source, which are three-dimensional vectors, the number of rooms of the house source may be 1, 2, 3, 4, 5, 6, or 7, etc., the price of the house source may be 100, 110, 120, 150, 170, 200, 250, 300, or 400, etc. (in units of ten thousand), and the area of the house source may be 56, 60, 75, 80, 92, or 100, etc. (in units of square meters).
Optionally, for any one of the dimension enumeration type vectors in the samples in the first set and the second set, the present disclosure may convert the dimension enumeration type vector in each sample in the first set and the second set into a multidimensional one-hot type vector, and set a numerical value of the multidimensional one-hot type vector of each sample according to an enumeration value of the dimension enumeration type vector in each sample in the first set and the second set. For example, assuming that the ith (i is an integer greater than 0) dimension vector in a sample is an enumerated type vector, and the number of enumerated values of the ith dimension vector is k (k is an integer greater than 1), the present disclosure may convert the ith dimension vector in each sample into a k-dimensional one-hot type vector, and for the k-dimensional one-hot type vector in any sample, the present disclosure sets the value of one dimension vector in the k-dimensional one-hot type vector to a first value (e.g., 1) and sets the values of the other dimension vectors in the k-dimensional one-hot type vector to a second value (e.g., 0) according to the enumerated value of the ith dimension vector in the sample.
In a specific example, assume that one of the one-dimensional vectors in a sample vector of a room-based sample is a room source loop vector, the room source loop vector is an enumerated type vector, and enumerated values of the room source loop vector may include: six enumerated values in one ring, in one ring outside two rings, in two ring outside three rings, in three ring outside four rings, in four ring outside five rings and in five ring outside six rings; then, the present disclosure may convert the house source loop vector into a six-dimensional one-hot type vector, and the six-dimensional one-hot type vector is: whether an in-loop vector is determined, whether an out-of-loop two-loop vector is determined, whether an out-of-two-loop three-in-loop vector is determined, whether a out-of-three-loop four-in-loop vector is determined, whether a four-loop out-of-five-loop vector is determined, and whether a five-loop out-of-six-loop vector is determined; for any sample, if the enumerated value of the room source loop line vector of the sample is in one loop, the present disclosure may set the value of the vector in one loop to 1, and set the values of the other five dimensionality one-hot vectors to 0, respectively, when replacing the room source loop line vector of the sample with the above six dimensionality one-hot vector.
Optionally, in the present disclosure, all enumerated type vectors in the sample vectors of all samples in the first set and the second set may be converted into vectors of a unique type, and then, dimension compression processing may be performed on the high-dimensional sparse vectors in all samples, and the samples in the first set and the samples in the second set may be updated by using samples obtained after performing the dimension compression and the stitching processing, that is, the updated samples in the first set and the second set are samples that can be directly used in a subsequent process.
The method and the device have the advantages that the enumeration type vectors in the samples are subjected to type conversion processing, so that the processing process of the subsequent flow is simplified, and the efficiency of improving the similarity among height sets is improved.
In an optional example, the present disclosure may perform at least one cluster division process, and in each cluster division process, the present disclosure may select a one-dimensional vector from a sample vector according to a preset selection manner, obtain a value based on a value of the selected one-dimensional vector according to another preset selection manner, and use the value of the one-dimensional vector as an interval value for cluster division of the cluster to be processed this time.
Optionally, for each cluster division processing, the present disclosure may randomly select a one-dimensional vector from the sample vectors based on a random selection manner, and form a value range by using the value of the selected one-dimensional vector, and then randomly select a value from the value range as an interval value for cluster division of the cluster to be processed. For example, assuming that in the process of performing cluster partition processing once, the randomly selected one-dimensional vector is the jth-dimensional vector in the sample vectors of the samples, the minimum value of the jth-dimensional vector in the sample vectors of all the samples in the first set and the second set is j1, and the maximum value is j2, the value range formed by the value of the jth-dimensional vector may be represented as (j 1, j 2), and then, the present disclosure may randomly select a value from (j 1, j 2) as the partition value of the cluster partition processing for the time to be processed. For example, the randomly selected vector is the number of rooms in the room source, the value range of the vector is (1, 6), and the partition value randomly selected from the value range is 2.5. For another example, the randomly selected vector is the price of the house source, the value range of the vector is (50, 200), and the randomly selected partition value from the value range is 150.6.
Optionally, for each cluster division processing, the present disclosure may randomly select a one-dimensional vector from the sample vectors based on a random selection manner, and select one value from all the values of the selected one-dimensional vector based on the random selection manner, where the one-dimensional vector is used as an interval value for cluster division of the cluster to be processed. For example, assuming that in the process of performing cluster partition processing once, the randomly selected one-dimensional vector is the jth-dimensional vector in the sample vectors of the samples, and values of the jth-dimensional vector in the sample vectors of all the samples (i.e., z samples) in the first set and the second set are j1, j2, j3, … …, and jz, respectively, the present disclosure may randomly select one value from the z values as the partition value for performing cluster partition processing on the cluster to be processed this time. For example, in one cluster division process, the randomly selected vector is the number of living rooms of the room source, the value set of the vector is {1, 2, 3, 4, 5, 6}, and the partition value randomly selected from the value set is 3. For another example, in a cluster partitioning process, the randomly selected vector is the price of the house source, the value set of the vector is {56, 58, 62, 68.5, 72.6, … …, 200}, and the partition value randomly selected from the value set is 72.6. It should be noted that, in any process of cluster division, the present disclosure generally does not suggest that the maximum value and the minimum value of all values of the randomly selected one-dimensional vector are used as the partition value.
Generally, the more similar the probability distributions of the samples in the first set and the samples in the second set are, the less likely that all the samples in the first set and all the samples in the second set belong to the same set is the random cluster division is performed by using a randomly selected vector and a randomly selected numerical value, and thus, as the cluster division processing frequency increases, the more likely that all the samples in the new cluster obtained by performing the cluster division processing each time using the randomly selected vector and the randomly selected numerical value belong to the same set is, so that the cluster division processing frequency is closely related to the similarity between the first set and the second set. That is to say, in the process of performing cluster division processing each time, the partition value for performing cluster division is obtained in a random manner, which is beneficial to improving the accuracy of the similarity between the height amount sets.
In an alternative example, for any cluster to be processed, the process of performing cluster division processing on samples in the cluster to be processed by using the partition value according to the present disclosure may be: respectively judging whether the magnitude relation between the value of the corresponding dimension vector of each sample in the cluster to be processed and the partition value of the cluster dividing processing at this time meets the preset magnitude relation, dividing all samples meeting the preset magnitude relation (such as being larger than a preset threshold value) in the cluster to be processed into a new cluster under the cluster to be processed, and dividing all samples not meeting the preset magnitude relation (such as being not larger than the preset value) in the cluster to be processed into another new cluster under the cluster to be processed.
Optionally, in any cluster division processing process, the cluster division processing result of the present disclosure may include: for a cluster to be processed, the cluster to be processed is divided into two new clusters. In one example, it is assumed that during a cluster division process, one cluster to be processed includes b (b is an integer, e.g., b is 100, etc.) samples, namely, the first sample, the second sample, the third sample … … and the b-th sample, assuming that the vector randomly selected by the cluster division processing is the j-th vector, and assuming that the partition value obtained by the cluster division processing in the random selection manner is j3, then, if the j-th dimension vectors of the first b1 samples (b 1 is an integer less than b) in the b samples all have values less than or equal to j3, and the values of the j-dimension vectors from the b1+1 th sample to the b sample in the b samples are all larger than j3, the first b1 samples of the b samples belong to the first new cluster, and the b1+1 th to the b samples of the b samples belong to the second new cluster, i.e. the cluster to be processed is divided into two new clusters.
Optionally, in any cluster division processing process, the cluster division processing result of the present disclosure may include: for a cluster to be processed, the cluster to be processed is not divided into two new clusters. For another example, in a process of cluster division, it is assumed that a cluster to be processed includes b (b is an integer, e.g., b is 100) samples, that is, a first sample, a second sample, a third sample … …, and a b-th sample, a vector randomly selected by the cluster division processing of this time is a j-th dimensional vector, and a partition value obtained by the cluster division processing of this time in a random selection manner is j3, so that if values of the j-th dimensional vectors of all samples in the b samples are less than or equal to j3, all samples in the b samples belong to the same new cluster, that is, the cluster to be processed is still a cluster.
According to the cluster partitioning method and device, whether the size relation between the value of the corresponding dimensional vector of all samples in the cluster to be processed and the partition value meets the preset size relation or not is judged, cluster partitioning processing of the cluster to be processed can be rapidly achieved under the condition that less computing resources are occupied, and therefore the efficiency of similarity among height sets is improved.
In one optional example, the current partitioning state in the present disclosure may be represented by one or more metrics. However, the different indexes are formed based on the sets to which the samples included in the new clusters respectively belong.
Optionally, in any cluster division processing process, after comparing the size relationships of all samples in all clusters to be processed with the current partition value, the present disclosure may obtain a plurality of new clusters, and if all samples included in one of all new clusters belong to the same set (that is, all samples included in one new cluster belong to the first set or the second set, such a cluster is referred to as a non-single cluster for short below), the present disclosure determines that a single cluster exists in the current cluster division processing result, and performs an accumulated counting operation for the single cluster, so as to obtain a first accumulated number of times.
In a specific example, assume that the initial value of the first accumulation number is 0;
first cluster division processing: and forming all samples in the first set and the second set into clusters to be processed, and performing cluster division processing on the clusters to be processed. In one example, if two new clusters, namely a first new cluster and a second new cluster, are obtained this time, and it is assumed that the first new cluster is a non-single-class cluster and it is assumed that the second new cluster is also a non-single-class cluster, the current value of the cumulative number is recorded as 0. In another example, if two new clusters are obtained this time, and the first new cluster is assumed to be a single-class cluster, and the second new cluster is assumed to be a non-single-class cluster, the current value of the accumulated number of times is recorded as 1.
And (3) second cluster division treatment: in one example, if four new clusters are obtained this time, namely a first new cluster, a second new cluster, a third new cluster and a fourth new cluster, and it is assumed that the first new cluster, the second new cluster, the third new cluster and the fourth new cluster are all non-single clusters, the current value of the accumulated times is still the same as the current value of the previous time and does not change. In another example, if four new clusters are obtained this time, and the first new cluster is assumed to be a single-class cluster, and the second new cluster, the third new cluster and the fourth new cluster are assumed to be non-single-class clusters, the current value of the accumulated number of times is recorded as 2.
The change of the current value of the first accumulated number in the subsequent cluster division process is substantially similar to the change of the current value of the first accumulated number in the previous two cluster division processes, and is not described in detail here.
Optionally, in any cluster division processing process, after comparing the size relationship of all samples in all clusters to be processed with the current partition value, the present disclosure may obtain a plurality of new clusters, and if there is a single cluster in all new clusters, the present disclosure may count the cumulative number of all currently occurring single clusters to obtain a second cumulative number.
In a specific example, assume that the initial value of the second accumulation number is 0;
first cluster division processing: and forming all samples in the first set and the second set into clusters to be processed, and performing cluster division processing on the clusters to be processed. In one example, if two new clusters, namely the first new cluster and the second new cluster, are obtained this time, and it is assumed that the first new cluster and the second new cluster are both non-single-class clusters, the current value of the second accumulated number of times is recorded as 0. In another example, if two new clusters, i.e., the first new cluster and the second new cluster, are obtained this time, and it is assumed that the first new cluster is a single-type cluster and the second new cluster is a non-single-type cluster, the current value of the second cumulative number is recorded as 1.
And (3) second cluster division treatment: taking all the non-single clusters obtained last time as clusters to be processed respectively, and performing cluster division processing on all the clusters to be processed respectively, in one example, if four new clusters are obtained this time, namely a first new cluster, a second new cluster, a third new cluster and a fourth new cluster, and it is assumed that the first new cluster, the second new cluster, the third new cluster and the fourth new cluster are all non-single clusters, the current value of the second cumulative number is still recorded as 0. In another example, if four new clusters are obtained this time, that is, the first new cluster, the second new cluster, the third new cluster and the fourth new cluster, and it is assumed that the first new cluster is a single-type cluster, and it is assumed that the second new cluster, the third new cluster and the fourth new cluster are all non-single-type clusters, the current value of the second cumulative number is recorded as 1. In another example, if one of the two new clusters obtained by the first cluster division processing is a single-type cluster, the cluster division processing is performed on the non-single-type cluster this time to obtain two new clusters, that is, a first new cluster and a second new cluster, and it is assumed that the first new cluster of the two new clusters is the single-type cluster and the second new cluster is the non-single-type cluster, then the current value of the second cumulative number is recorded as 2. In another example, if one of the two new clusters obtained by the first cluster division processing is a single-type cluster, the cluster division processing is performed on the non-single-type cluster this time to obtain two new clusters, that is, a first new cluster and a second new cluster, and it is assumed that the two new clusters are both non-single-type clusters, then the current value of the single-type cluster occupancy ratio is recorded as 1.
The change of the current value of the second accumulated number during the subsequent cluster division processing is similar to the change of the current value of the second accumulated number during the first two cluster division processing as exemplified above, and is not described in detail here.
Optionally, in any cluster division processing process, after comparing the size relationship of all samples in all clusters to be processed respectively by using the current partition value, the present disclosure may obtain a plurality of new clusters, and if there is a single cluster in all new clusters, the present disclosure may count the number of current single clusters, and obtain the single cluster ratio according to the ratio of the number of current single clusters to the number of current all clusters.
In a specific example, assume that the initial value of the single-class cluster proportion is 0;
first cluster division processing: and forming all samples in the first set and the second set into clusters to be processed, and performing cluster division processing on the clusters to be processed. In one example, if two new clusters, namely the first new cluster and the second new cluster, are obtained this time, and it is assumed that the first new cluster and the second new cluster are both non-single-class clusters, the current value of the single-class cluster occupancy is recorded as 0. In another example, if two new clusters, i.e., the first new cluster and the second new cluster, are obtained this time, and the first new cluster is assumed to be a single-class cluster, and the second new cluster is assumed to be a non-single-class cluster, the current value of the single-class cluster ratio is recorded as 1/2.
And (3) second cluster division treatment: taking all the non-single clusters obtained last time as clusters to be processed respectively, and performing cluster division processing on all the clusters to be processed respectively, in one example, if four new clusters are obtained this time, namely a first new cluster, a second new cluster, a third new cluster and a fourth new cluster, and it is assumed that the first new cluster, the second new cluster, the third new cluster and the fourth new cluster are all non-single clusters, the current value of the single cluster occupancy is still marked as 0. In another example, if four new clusters are obtained at this time, i.e., the first new cluster, the second new cluster, the third new cluster and the fourth new cluster, and it is assumed that the first new cluster is a single-class cluster, and it is assumed that the second new cluster, the third new cluster and the fourth new cluster are all non-single-class clusters, the current value of the single-class cluster occupancy is recorded as 1/4. In another example, if one of the two new clusters obtained by the first cluster division processing is a single-class cluster, the cluster division processing is performed on the non-single-class cluster this time to obtain two new clusters, that is, a first new cluster and a second new cluster, and it is assumed that the first new cluster of the two new clusters is the single-class cluster and the second new cluster is the non-single-class cluster, then the current value of the single-class cluster occupancy is recorded as 2/3. In another example, if one of the two new clusters obtained by the first cluster division processing is a single-type cluster, the cluster division processing is performed on the non-single-type cluster this time to obtain two new clusters, that is, a first new cluster and a second new cluster, and it is assumed that the two new clusters are both non-single-type clusters, then the current value of the single-type cluster occupancy is recorded as 1/3. That is, although the single cluster obtained in the cluster division process executed before is not regarded as the cluster to be processed (i.e., does not participate in the subsequent cluster division process), the single cluster obtained in the cluster division process executed this time is counted in the numerator and the denominator.
The change of the current value of the single-class cluster occupation ratio in the subsequent cluster division processing is similar to the change of the current value of the single-class cluster occupation ratio in the first two cluster division processing processes, which are exemplified above, and a detailed description thereof is omitted.
Optionally, in any cluster division process, the present disclosure may obtain a plurality of new clusters after comparing the size relationship of all samples in all clusters to be processed with the current partition value, and may calculate the respective sample bias values of each new cluster according to the respective sets to which all samples included in each new cluster belong, and calculate the mean value of all sample bias values obtained this time. The maximum value of the sample bias value for any new cluster may be 0.5 and the minimum value may be a value infinitely close to zero. One specific example is as follows:
first cluster division processing: and forming all samples in the first set and the second set into clusters to be processed, and performing cluster division processing on the clusters to be processed. In one example, if two new clusters, i.e., the first new cluster and the second new cluster, are obtained this time, and it is assumed that the first new cluster and the second new cluster are both non-single-class clusters, the first new cluster includes c11 samples, where c111 samples belong to the first set, where c112 samples belong to the second set, and the second new cluster includes c12 samples, where c121 samples belong to the first set, where c122 samples belong to the second set, the sample bias value of the first new cluster is equal to or (the two calculation results are the same), the sample bias value of the second new cluster is equal to or (the two calculation results are the same), and the average of the sample bias values of all clusters obtained by the cluster dividing process this time is equal to. In another example, if two new clusters are obtained this time, i.e., a first new cluster and a second new cluster, and the first new cluster is assumed to be a single-type cluster, the second new cluster is assumed to be a non-single-type cluster, the first new cluster includes c11 samples, the second new cluster includes c12 samples, where c121 samples belong to a first set, and c122 samples belong to a second set, the sample bias value of the first new cluster is, the sample bias value of the second new cluster is, or, the average of the sample bias values of all clusters obtained by the cluster dividing process this time is.
And (3) second cluster division treatment: respectively taking all non-single clusters obtained in the previous time as clusters to be processed, and respectively carrying out cluster division processing on all clusters to be processed;
in an example, if the second cluster division process obtains four new clusters, i.e., a first new cluster, a second new cluster, a third new cluster and a fourth new cluster, assuming that the first new cluster, the second new cluster, the third new cluster and the fourth new cluster are all non-single type clusters, the first new cluster includes c21 samples, wherein c211 samples belong to the first set, wherein c212 samples belong to the second set, the second new cluster includes c22 samples, wherein c221 samples belong to the first set, wherein c222 samples belong to the second set, the third new cluster includes c23 samples, wherein c231 samples belong to the first set, wherein c232 samples belong to the second set, the fourth new cluster includes c24 samples, wherein c241 samples belong to the first set, wherein c242 samples belong to the second set; the sample bias value of the first new cluster is or (the two calculation results are the same), the sample bias value of the second new cluster is or (the two calculation results are the same), the sample bias value of the third new cluster is or (the two calculation results are the same), the sample bias value of the fourth new cluster is or (the two calculation results are the same), and the mean value of all the sample bias values obtained at this time is the mean value of the sample bias values of the four new clusters.
In another example, if the second cluster division process obtains four new clusters, i.e., a first new cluster, a second new cluster, a third new cluster and a fourth new cluster, assuming that the first new cluster is a single-class cluster, and assuming that the second new cluster, the third new cluster and the fourth new cluster are all non-single-class clusters, the first new cluster includes c21 samples, wherein c21 samples all belong to the first set or all belong to the second set, the second new cluster includes c22 samples, wherein c221 samples belong to the first set, wherein c222 samples belong to the second set, the third new cluster includes c23 samples, wherein c231 samples belong to the first set, wherein c232 samples belong to the second set, the fourth new cluster includes c24 samples, wherein c241 samples belong to the first set, wherein c242 samples belong to the second set; the sample bias value of the first new cluster is equal to or (the two calculation results are the same), the sample bias value of the second new cluster is equal to or (the two calculation results are the same), the sample bias value of the third new cluster is equal to or (the two calculation results are the same), the sample bias value of the fourth new cluster is equal to or (the two calculation results are the same), and the mean value of all the sample bias values obtained at this time is the mean value of the sample bias values of the four new clusters.
In yet another example, if one of the two new clusters obtained by the first cluster division process is a single-type cluster, and the cluster division process is performed on the non-single-type cluster for the second time to obtain two new clusters, i.e., a first new cluster and a second new cluster, assuming that the first new cluster of the two new clusters is the single-type cluster and the second new cluster is the non-single-type cluster, the first new cluster includes c21 samples, wherein c21 samples all belong to the first set or all belong to the second set, and the second new cluster includes c22 samples, wherein c221 samples belong to the first set and c222 samples belong to the second set; the sample bias value of the first new cluster is, the sample bias value of the second new cluster is, or, the mean value of all the sample bias values obtained this time is, the mean value of the sample bias values of the two new clusters.
In yet another example, if one of the two new clusters obtained by the first cluster division process is a single-type cluster, and the cluster division process performed for the non-single-type cluster for the second time also obtains two new clusters, i.e., a first new cluster and a second new cluster, assuming that the two new clusters are both non-single-type clusters, the first new cluster includes c21 samples, where c211 samples belong to the first set, where c212 samples belong to the second set, and the second new cluster includes c22 samples, where c221 samples belong to the first set, and where c222 samples belong to the second set; the sample bias value of the first new cluster is or, the sample bias value of the second new cluster is or, and the mean value of all the sample bias values obtained this time is the mean value of the sample bias values of the two new clusters.
The process of obtaining the mean value of all the sample bias values in each cluster division processing is similar to the process of obtaining the mean value of all the sample bias values in the previous two cluster division processing, and is not described in detail here.
Optionally, the present disclosure may use at least one of the first accumulated number of times, the second accumulated number of times, the single-class cluster ratio, and a mean value of the sample bias values as the current partitioning state. Namely, the current partitioning state in the present disclosure includes: at least one of the first accumulation number, the second accumulation number, the single cluster proportion, and a mean value of the sample bias values.
Optionally, in the present disclosure, after the cluster division processing is executed each time, a count operation may be performed on the currently executed cluster division processing, so as to obtain a third cumulative number. The present disclosure may take the third accumulated number of times as part of the current division state. Namely, in the current partitioning state of the present disclosure, the following are included: on the basis of at least one of the first accumulated number of times, the second accumulated number of times, the single cluster proportion, and the mean value of the sample bias value, the current partition state may further include: and a third cumulative number.
In an optional example, since the four indexes may be individually used as the current division state, any one of the four indexes and the third accumulated number may be used as the current division state, any two or three of the four indexes and the third accumulated number may be used as the current division state, the four indexes may be used as the current division state, and the four indexes and the third accumulated number may be used as the current division state, the present disclosure may determine whether the current division state meets the preset division stop state in various ways. The following is described by way of a few specific examples only:
for example, the present disclosure may determine that the current partition state does not conform to the preset partition stop state when the size relationship between the first accumulated number of times and the first predetermined accumulated number of times (e.g., 1 or an integer greater than 1) does not conform to the first preset size relationship (e.g., the first accumulated number of times does not reach the first predetermined accumulated number of times), and at this time, the next cluster partition process needs to be continuously executed; when the magnitude relationship between the first cumulative count and the first predetermined cumulative count conforms to a first preset magnitude relationship (e.g., the first cumulative count reaches the first predetermined cumulative count), it is determined that the current partitioning state conforms to the preset partitioning stop state, and at this time, the next cluster partitioning process may not be executed any longer.
As a second example, the present disclosure may determine that the current partitioning state does not conform to the preset partitioning stop state when the size relationship between the first cumulative number and the first predetermined cumulative number (e.g., 1 or an integer greater than 1) does not conform to the first predetermined size relationship (e.g., the first cumulative number does not reach the first predetermined cumulative number), and the third cumulative number does not reach the maximum cluster partitioning number, and at this time, the next cluster partitioning process needs to be continuously performed; and under the condition that the magnitude relation between the first accumulation frequency and the first preset accumulation frequency accords with a first preset magnitude relation (if the first accumulation frequency reaches the first preset accumulation frequency) or the third accumulation frequency reaches the maximum cluster division frequency, determining that the current division state accords with a preset division stop state, and at this moment, not continuing to execute the next cluster division processing.
As a third example, the present disclosure may determine that the current partitioning state does not conform to the preset partitioning stop state when the size relationship between the second accumulated number of times and the second predetermined accumulated number of times (e.g., 1 or an integer greater than 1) does not conform to the second preset size relationship (e.g., the second accumulated number of times does not reach the second predetermined accumulated number of times), and at this time, the next cluster partitioning process needs to be continuously executed; and under the condition that the magnitude relation between the second accumulated time and the second preset accumulated time accords with a second preset magnitude relation (if the second accumulated time reaches the second preset accumulated time), determining that the current dividing state accords with a preset dividing stop state, and at this moment, not continuing to execute the next cluster dividing processing.
As a fourth example, the present disclosure may determine that the current partitioning state does not conform to the preset partitioning stop state when the size relationship between the second cumulative number and the second predetermined cumulative number (e.g., 1 or an integer greater than 1) does not conform to the second preset size relationship (e.g., the second cumulative number does not reach the second predetermined cumulative number), and the third cumulative number does not reach the maximum cluster partitioning number, and at this time, the next cluster partitioning process needs to be continuously performed; and under the condition that the magnitude relation between the second accumulated time and the second preset accumulated time accords with a second preset magnitude relation (for example, the second accumulated time reaches the second preset accumulated time) or the third accumulated time reaches the maximum cluster dividing time, determining that the current dividing state accords with a preset dividing stop state, and at this moment, not continuing to execute the next cluster dividing processing.
In a fifth example, the present disclosure may determine that the current partition state does not conform to the preset partition stop state when the size relationship between the single-class cluster proportion and the predetermined single-class cluster proportion does not conform to the third preset size relationship (e.g., the single-class cluster proportion does not reach the predetermined single-class cluster proportion), and at this time, the next cluster partition processing needs to be continuously executed; and under the condition that the size relationship between the single-class cluster proportion and the predetermined single-class cluster proportion accords with a third preset size relationship (for example, the single-class cluster proportion reaches the predetermined single-class cluster proportion), determining that the current dividing state accords with a preset dividing stop state, and at the moment, not continuing to execute the next cluster dividing treatment.
In a sixth example, the present disclosure may determine that the current partitioning state does not conform to the preset partitioning stop state when the size relationship between the single-class cluster proportion and the predetermined single-class cluster proportion does not conform to the third preset size relationship (for example, the single-class cluster proportion does not reach the predetermined single-class cluster proportion) and the third cumulative number of times does not reach the maximum cluster partitioning number of times, and at this time, it is necessary to continue to perform the next cluster partitioning process; and under the condition that the size relationship between the single-class cluster proportion and the predetermined single-class cluster proportion accords with a third preset size relationship (for example, the single-class cluster proportion reaches the predetermined single-class cluster proportion) or the third accumulated times reaches the maximum cluster dividing times, determining that the current dividing state accords with a preset dividing stop state, and at the moment, continuing to execute the next cluster dividing processing.
As a seventh example, the present disclosure may determine that the current partition state does not conform to the preset partition stop state when the magnitude relationship between the average value of the sample deviation values and the predetermined deviation value does not conform to the fourth preset magnitude relationship (e.g., the average value of the sample deviation values does not reach the predetermined deviation value), and at this time, the next cluster partition process needs to be continuously performed. And under the condition that the magnitude relation between the average value of the sample deviation values and the preset deviation values accords with a fourth preset magnitude relation (for example, the average value of the sample deviation values reaches the preset deviation values), determining that the current dividing state accords with a preset dividing stop state, and at the moment, not continuing to execute the next cluster dividing treatment.
As an eighth example, the present disclosure may determine that the current division state does not conform to the preset division stop state when the magnitude relationship between the average value of the sample deviation values and the predetermined deviation value does not conform to the fourth preset magnitude relationship (e.g., the average value of the sample deviation values does not reach the predetermined deviation value) and the third cumulative number does not reach the maximum cluster division number, and at this time, the next cluster division process needs to be continuously performed. And under the condition that the size relation between the average value of the sample deviation values and the preset deviation values accords with a fourth preset size relation (for example, the average value of the sample deviation values reaches the preset deviation values) or the third accumulated times reaches the maximum cluster dividing times, determining that the current dividing state accords with a preset dividing stop state, and at the moment, not continuing to execute the next cluster dividing treatment.
This is disclosed not only can select corresponding index according to actual demand, and the precision of nimble control degree of similarity between the measure set can also be through the biggest cluster division number of times, control degree of similarity's processing speed between the measure set to this is favorable to promoting the adaptability of the technique of degree of similarity between the measure set.
In an optional example, each cluster to be processed in the present disclosure is a non-single cluster, and therefore, after a cluster division process is performed once, when the cluster to be processed is updated according to a new cluster obtained by the cluster division process, a single cluster in the new cluster should be avoided to be used as a cluster to be processed, that is, the present disclosure should determine non-single clusters in all new clusters obtained by performing the cluster division process this time, and respectively use all the non-single clusters as clusters to be processed.
Because the single-class cluster can show that all samples in one cluster have certain similarity, the cluster division processing is carried out on the non-single-class cluster, the cluster division processing process can be avoided for the samples with the similarity, and the rapid measurement of the similarity among the sets is facilitated.
In an alternative example, when it is determined that the current division state corresponds to the preset division stop state, the present disclosure may perform a corresponding operation on the number of times of cluster division processing that has been currently performed, and use the operation result as the similarity between the first set and the second set.
Since the more the number of times of cluster division processing that has been currently performed, the more difficult it is to appear a single-class cluster, thereby representing that all samples in the cluster to be processed are relatively similar, the present disclosure facilitates convenient and accurate measurement of the similarity between sets by taking the reciprocal of the number of times of cluster division processing that has been currently performed as the similarity between the first set and the second set.
An example is shown in fig. 2A to 2C, where each white circle represents a sample in the first set and each gray circle represents a sample in the second set.
From the distribution of samples, the samples in the first set and the samples in the second set in fig. 2A have low similarity, and it is likely that a single cluster can be obtained through one cluster division process, for example, the straight line 200 in fig. 2A represents a partition value, which partitions all samples into two parts, i.e., two new clusters, the samples on the right side of the straight line 200 belong to one new cluster, the samples on the left side of the straight line 200 belong to another new cluster, and all the samples on the right side of the straight line 200 are the samples belonging to the second set, and all the samples on the left side of the straight line 200 include the samples in the first set and the samples in the second set. If the subsequent cluster division processing is stopped once a single cluster appears, the method can quickly obtain the similarity between the first set and the second set after executing the cluster division processing once, and the similarity is 1. The maximum value of the similarity between the two sets can be 1, and a smaller value of the similarity indicates more similarity, i.e., a smaller value of the similarity indicates more similarity.
From the distribution of samples, the similarity between the samples in the first set and the samples in the second set in fig. 2B is higher than that in fig. 2A, and it is likely that a single cluster will occur through two cluster division processes, for example, the straight line 201 and the straight line 202 in fig. 2B respectively represent a partition value, the partition value represented by the straight line 201 divides all samples into two parts, i.e., two new clusters, the samples on the right side of the straight line 201 belong to one new cluster, the samples on the left side of the straight line 201 belong to another new cluster, and all the samples on the left side and the right side of the straight line 201 include the samples in the first set and the samples in the second set. The partition value represented by the straight line 202 divides all samples on the left side of the straight line 201 into two parts, namely two new clusters, the samples on the left side of the straight line 202 belong to one new cluster, and all samples from the right side of the straight line 202 to the left side of the straight line 201 belong to the other new cluster, because all samples on the left side of the straight line 202 are samples belonging to the second set, if a single cluster occurs, the subsequent cluster division processing is stopped, the present disclosure can quickly obtain the similarity between the first set and the second set after two cluster division processing is performed, and if the similarity is 0.5.
From the distribution of samples, the similarity between the samples in the first set and the samples in the second set in fig. 2C is higher than that in fig. 2B, it is likely that a single-class cluster will appear only through five cluster division processes, and if a single-class cluster appears, that is, the subsequent cluster division process is stopped, the present disclosure may obtain the similarity between the first set and the second set after performing five cluster division processes, for example, the similarity is 0.2.
Exemplary devices
Fig. 3 is a schematic structural diagram of an embodiment of an apparatus for measuring similarity between sets according to the present disclosure. The apparatus of this embodiment may be used to optimize the method embodiments of the present disclosure described above. As shown in fig. 3, the apparatus of the present embodiment includes: the method comprises a sample vector obtaining module 300, a cluster to be processed determining module 301, a partition value determining module 302, a cluster dividing processing module 303, a state determining module 304, a loop control module 305 and a similarity determining module 306.
The get sample vector module 300 is configured to get a sample vector for each sample in the first set and a sample vector for each sample in the second set.
The to-be-processed cluster determining module 301 is configured to use all samples in the first set and the second set acquired by the sample vector acquiring module 300 as samples in the to-be-processed cluster.
The partition value determining module 302 is configured to determine a partition value for cluster division of the cluster to be processed according to a value of a one-dimensional vector of the sample vector.
The cluster division processing module 303 is configured to perform cluster division processing on the samples in the to-be-processed cluster obtained by the to-be-processed cluster determining module 301 by using the partition value obtained by the partition value determining module 302, so as to obtain at least one new cluster.
The state determining module 304 is configured to determine a current partitioning state according to a set to which each sample included in each new cluster obtained by the cluster partitioning processing module 303 belongs.
The loop control module 305 is configured to, if the current partition state obtained by the state determination module 304 does not conform to the preset partition stop state, update the cluster to be processed according to the new cluster obtained by the cluster partition processing module 303, and trigger the partition value determination module 302 to perform an operation of determining a partition value for cluster partition of the cluster to be processed this time according to a value of one-dimensional vector of the sample vector.
The similarity determining module 306 is configured to determine a similarity between the first set and the second set according to the number of times of cluster division processing that has been currently performed if the current division state obtained by the state determining module 304 meets a preset division stop state.
Optionally, the module 300 for obtaining a sample vector may include: at least one of the first sub-module 3001 and the second sub-module 3002. The first sub-module 3001 is configured to perform dimension compression processing on at least some dimension vectors of each sample in the first set and the second set to obtain a multidimensional compressed vector of each sample, and if at least one dimension vector which is not subjected to compression processing exists in a sample, the first sub-module 3001 splices the multidimensional compressed vector of each sample with each dimension vector which is not subjected to compression processing in each sample, so as to obtain a sample vector of each sample in the first set and a sample vector of each sample in the second set. The second sub-module 3002 is configured to: for any one dimension enumeration type vector in the samples in the first set and the second set, converting the dimension enumeration type vector in each sample in the first set and the second set into a multi-dimension one-hot type vector respectively, and setting a numerical value of the multi-dimension one-hot type vector of each sample according to an enumeration value of the dimension enumeration type vector in each sample in the first set and the second set. In the case that the sample vector obtaining module 300 includes the first sub-module 3001 and the second sub-module 3002, the second sub-module 3002 may perform type conversion on the vector of the corresponding dimension in each sample, and then the first sub-module 3001 performs dimension compression.
Optionally, the determining the partition value module 302 may include: a third sub-module 3021 or a fourth sub-module 3022. The third sub-module 3021 is configured to select a one-dimensional vector from the sample vector based on a random selection manner, determine a value range formed by values of the selected one-dimensional vector, select one value from the value range based on the random selection manner, and use the selected value as an interval value for cluster division of the cluster to be processed this time. The fourth sub-module 3022 is configured to select a one-dimensional vector from the sample vectors based on a random selection manner, select one value from all values of the one-dimensional vector based on the random selection manner, and use the selected value as an interval value for cluster division of the cluster to be processed.
Optionally, the cluster dividing processing module 303 may be further configured to, for any sample in any cluster to be processed, determine that the sample is divided into a new cluster under the cluster to be processed if a size relationship between a value of a corresponding dimensional vector of the sample and the partition value satisfies a preset size relationship, and otherwise determine that the sample is divided into another new cluster under the cluster to be processed.
Optionally, the determining the status module 304 includes: at least one of a fifth sub-module 3041, a sixth sub-module 3042, a seventh sub-module 3043, and an eighth sub-module 3044.
The fifth sub-module 3041 is configured to, when it is determined that all samples included in one of the new clusters belong to the same set, determine that a single cluster exists in the cluster division processing result, and perform an accumulation counting operation for the single cluster, so as to obtain a first accumulation number.
The sixth sub-module 3042 is configured to, when it is determined that all samples included in one of the new clusters belong to the same set, determine that a single cluster exists in the cluster division processing result, and perform an accumulation counting operation of the number of the single clusters to obtain a second accumulation number.
The seventh sub-module 3043 is configured to, when it is determined that all samples included in one of the new clusters belong to the same set, determine that a single-class cluster exists in the cluster division processing result, and obtain a single-class cluster proportion according to a ratio of the number of the single-class clusters to the number of the new clusters.
The eighth sub-module 3044 is configured to determine a sample bias value of each new cluster according to the set to which all samples included in each new cluster belong.
And at least one of the first accumulation times, the second accumulation times, the single-class cluster ratio and the sample bias value is used for forming the current division state.
Optionally, the determining the status module 304 may further include: a ninth sub-module 3045, configured to count the currently executed cluster division processing to obtain a third accumulated number, where the current division state includes: and a third cumulative number.
Optionally, the current partitioning state in the present disclosure does not conform to the preset partitioning stop state, and includes at least one of: the size relation between the first accumulation frequency and the first preset accumulation frequency does not conform to the first preset size relation, and the third accumulation frequency does not reach the preset maximum cluster dividing frequency; the size relation between the second accumulation frequency and the second preset accumulation frequency does not conform to the second preset size relation, and the third accumulation frequency does not reach the preset maximum cluster dividing frequency; the size relation between the single-class cluster proportion and the preset single-class cluster proportion does not conform to a third preset size relation, and the third cumulative times do not reach the preset maximum cluster dividing times; the magnitude relation of the mean value and the preset deviation value does not conform to the fourth preset magnitude relation, and the third cumulative times do not reach the preset maximum cluster dividing times.
Optionally, when the cluster to be processed is updated, the loop control module 305 may determine the non-single cluster in all the new clusters obtained by performing the cluster division processing this time, and respectively use all the non-single clusters as the cluster to be processed. Wherein the non-single cluster is a new cluster containing both samples from the first set and samples from the second set.
Optionally, the module for determining similarity 306 may be further configured to: the reciprocal of the number of times cluster division processing has been currently performed is taken as the degree of similarity of the first set and the second set.
The operations specifically executed by the modules and the sub-modules and units included in the modules may be referred to in the description of the method embodiments with reference to fig. 1-2, and are not described in detail here.
Exemplary electronic device
An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 4. FIG. 4 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 4, the electronic device 41 includes one or more processors 411 and memory 412.
The processor 411 may be a Central Processing Unit (CPU) or other form of processing unit having the capability to measure similarity between sets and/or instruction execution capability, and may control other components in the electronic device 41 to perform desired functions.
Memory 412 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory, for example, may include: random Access Memory (RAM) and/or cache memory (cache), etc. The nonvolatile memory, for example, may include: read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 411 to implement the above-described methods of measuring inter-set similarity of various embodiments of the present disclosure and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
In one example, the electronic device 41 may further include: an input device 413, and an output device 414, etc., interconnected by a bus system and/or other form of connection mechanism (not shown). The input device 413 may also include, for example, a keyboard, a mouse, and the like. The output device 414 can output various information to the outside. The output devices 414 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device 41 relevant to the present disclosure are shown in fig. 4, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 41 may include any other suitable components, depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in a method of measuring inter-set similarity according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of measuring inter-set similarity according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, and systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," comprising, "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for optimizing methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (13)

1. A method of measuring inter-set similarity, comprising:
obtaining sample vectors of all samples in the first set and the second set;
taking all samples in the first set and the second set as samples in a cluster to be processed;
determining an interval value for cluster division of the cluster to be processed at this time according to the value of one-dimensional vector of the sample vector;
performing cluster division processing on the samples in the cluster to be processed by using the partition value to obtain at least one new cluster;
determining the current division state according to the respective sets of the samples contained in the new clusters;
if the current division state does not accord with a preset division stop state, updating the cluster to be processed according to the new cluster, returning to the value of one-dimensional vector of the sample vector, and determining the partition value for cluster division of the cluster to be processed this time;
and if the current dividing state accords with a preset dividing stop state, determining the similarity of the first set and the second set according to the number of times of cluster dividing processing which is executed currently.
2. The method of claim 1, wherein said obtaining a sample vector for each sample in the first set and the second set comprises:
performing dimension compression processing on at least part of dimension vectors of each sample in the first set and the second set to obtain a multi-dimension compressed vector of each sample;
and if at least one-dimensional vectors which are not subjected to compression processing exist in the samples, splicing the multi-dimensional compressed vectors of the samples with the dimensional vectors which are not subjected to compression processing in the samples respectively to obtain sample vectors of the samples in the first set and sample vectors of the samples in the second set.
3. The method of claim 1, wherein said obtaining a sample vector for each sample in the first set and the second set comprises:
for any one dimension enumeration type vector in the samples in the first set and the second set, converting the dimension enumeration type vector in each sample in the first set and the second set into a multi-dimension one-hot type vector respectively, and setting a numerical value of the multi-dimension one-hot type vector of each sample according to an enumeration value of the dimension enumeration type vector in each sample in the first set and the second set.
4. The method according to claim 1, wherein the determining, according to a value of one-dimensional vector of the sample vector, an interval value for cluster division of the cluster to be processed this time includes:
selecting a one-dimensional vector from the sample vectors based on a random selection mode, determining a value range formed by the values of the selected one-dimensional vector, selecting one value from the value range based on the random selection mode, and taking the selected value as an interval value for cluster division of the cluster to be processed; or
And selecting a one-dimensional vector from the sample vectors based on a random selection mode, selecting one value from all values of the one-dimensional vector based on the random selection mode, and taking the selected value as an interval value for cluster division of the cluster to be processed.
5. The method of claim 1, wherein the performing a cluster partition process on the samples in the cluster to be processed by using the partition value to obtain at least one new cluster comprises:
for any sample in any cluster to be processed, if the magnitude relation between the value of the corresponding dimension vector of the sample and the partition value meets a preset magnitude relation, determining that the sample is divided into a new cluster under the cluster to be processed, otherwise, determining that the sample is divided into another new cluster under the cluster to be processed.
6. The method according to claim 1, wherein the determining the current partitioning state according to the set to which each sample included in each new cluster belongs includes at least one of:
under the condition that all samples contained in one of the new clusters belong to the same set, determining that a single cluster exists in the cluster dividing processing result, and executing accumulated counting operation aiming at the single cluster to obtain a first accumulated time;
under the condition that all samples contained in one of the new clusters belong to the same set, determining that single clusters exist in the cluster dividing processing result, and executing the cumulative counting operation of the number of the single clusters to obtain a second cumulative number;
under the condition that all samples contained in one of the new clusters belong to the same set, determining that single clusters exist in the cluster dividing processing result, and obtaining the single cluster proportion according to the ratio of the number of the single clusters to the number of the new clusters;
determining the sample bias value of each new cluster according to the set to which all samples contained in each new cluster belong;
wherein at least one of the first accumulation number, the second accumulation number, the single-class cluster proportion and the sample bias value is used for forming the current division state.
7. The method according to claim 6, wherein the determining a current partitioning state according to a set to which each sample included in each new cluster belongs further comprises:
counting the currently executed cluster division processing to obtain a third accumulated time;
wherein the current partitioning state comprises: and a third cumulative number.
8. The method of claim 7, wherein the current partitioning state does not conform to a preset partitioning-stop state, comprising at least one of:
the size relation between the first accumulation frequency and the first preset accumulation frequency does not conform to the first preset size relation, and the third accumulation frequency does not reach the preset maximum cluster dividing frequency;
the size relation between the second accumulation frequency and the second preset accumulation frequency does not conform to the second preset size relation, and the third accumulation frequency does not reach the preset maximum cluster dividing frequency;
the size relation between the single cluster proportion and the preset single cluster proportion does not conform to a third preset size relation, and the third accumulated times does not reach the preset maximum cluster dividing times;
and the size relation between the average value and the preset deviation value does not conform to a fourth preset size relation, and the third cumulative frequency does not reach the preset maximum cluster dividing frequency.
9. The method according to any one of claims 1 to 8, wherein the updating the cluster to be processed according to each new cluster comprises:
determining non-single clusters in all new clusters obtained by executing cluster division processing this time, and respectively taking all the non-single clusters as clusters to be processed;
wherein the non-single cluster is a new cluster containing both the samples in the first set and the samples in the second set.
10. The method of claim 1, wherein said determining a similarity of said first and second sets based on a number of cluster division processes that have been currently performed comprises:
and taking the reciprocal of the number of times of the cluster division processing which has been currently performed as the similarity of the first set and the second set.
11. An apparatus for measuring inter-set similarity, wherein the apparatus comprises:
a sample vector obtaining module, configured to obtain a sample vector of each sample in the first set and a sample vector of each sample in the second set;
determining a cluster to be processed module, which is used for taking all samples in the first set and the second set as samples in the cluster to be processed;
a partition value determining module, configured to determine a partition value for cluster division on the cluster to be processed this time according to a value of one-dimensional vector of the sample vector;
the cluster dividing processing module is used for performing cluster dividing processing on the samples in the cluster to be processed by utilizing the partition value to obtain at least one new cluster;
a state determining module, configured to determine a current partitioning state according to a set to which each sample included in each new cluster belongs;
the cycle control module is used for updating the cluster to be processed according to the new cluster if the current division state does not accord with a preset division stop state, and triggering the partition value determining module to execute the operation of determining the partition value for cluster division on the cluster to be processed according to the value of one-dimensional vector of the sample vector;
and the similarity determining module is used for determining the similarity of the first set and the second set according to the number of times of cluster division processing which is executed currently if the current division state accords with a preset division stop state.
12. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-10.
13. An electronic device, the electronic device comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-10.
CN202110138928.6A 2021-02-02 2021-02-02 Method, device, medium and electronic equipment for measuring similarity between sets Active CN112508134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110138928.6A CN112508134B (en) 2021-02-02 2021-02-02 Method, device, medium and electronic equipment for measuring similarity between sets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110138928.6A CN112508134B (en) 2021-02-02 2021-02-02 Method, device, medium and electronic equipment for measuring similarity between sets

Publications (2)

Publication Number Publication Date
CN112508134A true CN112508134A (en) 2021-03-16
CN112508134B CN112508134B (en) 2021-06-04

Family

ID=74952625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110138928.6A Active CN112508134B (en) 2021-02-02 2021-02-02 Method, device, medium and electronic equipment for measuring similarity between sets

Country Status (1)

Country Link
CN (1) CN112508134B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170358434A1 (en) * 2016-06-09 2017-12-14 Shimadzu Corporation Big-data analyzing method and mass spectrometric system using the same method
CN108763420A (en) * 2018-05-24 2018-11-06 广州视源电子科技股份有限公司 Sorting technique, device, terminal and the computer readable storage medium of data object
CN108764726A (en) * 2018-05-29 2018-11-06 阿里巴巴集团控股有限公司 The method and device of decision is carried out to request according to rule
CN108764319A (en) * 2018-05-21 2018-11-06 北京京东尚科信息技术有限公司 A kind of sample classification method and apparatus
CN109101620A (en) * 2018-08-08 2018-12-28 广州神马移动信息科技有限公司 Similarity calculating method, clustering method, device, storage medium and electronic equipment
CN109960763A (en) * 2019-03-21 2019-07-02 湖南大学 A kind of photography community personalization friend recommendation method based on user's fine granularity photography preference
CN110633737A (en) * 2019-08-29 2019-12-31 广州视源电子科技股份有限公司 Inter-cluster similarity determination method, device, equipment and storage medium
CN110909817A (en) * 2019-11-29 2020-03-24 深圳市商汤科技有限公司 Distributed clustering method and system, processor, electronic device and storage medium
CN111222916A (en) * 2020-01-03 2020-06-02 腾讯科技(深圳)有限公司 Method and device for determining delivery area, model training method and storage medium
CN111310834A (en) * 2020-02-19 2020-06-19 深圳市商汤科技有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111950632A (en) * 2020-08-12 2020-11-17 贝壳技术有限公司 Method, apparatus, medium, and electronic device for implementing clustering

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170358434A1 (en) * 2016-06-09 2017-12-14 Shimadzu Corporation Big-data analyzing method and mass spectrometric system using the same method
CN108764319A (en) * 2018-05-21 2018-11-06 北京京东尚科信息技术有限公司 A kind of sample classification method and apparatus
CN108763420A (en) * 2018-05-24 2018-11-06 广州视源电子科技股份有限公司 Sorting technique, device, terminal and the computer readable storage medium of data object
CN108764726A (en) * 2018-05-29 2018-11-06 阿里巴巴集团控股有限公司 The method and device of decision is carried out to request according to rule
CN109101620A (en) * 2018-08-08 2018-12-28 广州神马移动信息科技有限公司 Similarity calculating method, clustering method, device, storage medium and electronic equipment
CN109960763A (en) * 2019-03-21 2019-07-02 湖南大学 A kind of photography community personalization friend recommendation method based on user's fine granularity photography preference
CN110633737A (en) * 2019-08-29 2019-12-31 广州视源电子科技股份有限公司 Inter-cluster similarity determination method, device, equipment and storage medium
CN110909817A (en) * 2019-11-29 2020-03-24 深圳市商汤科技有限公司 Distributed clustering method and system, processor, electronic device and storage medium
CN111222916A (en) * 2020-01-03 2020-06-02 腾讯科技(深圳)有限公司 Method and device for determining delivery area, model training method and storage medium
CN111310834A (en) * 2020-02-19 2020-06-19 深圳市商汤科技有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111950632A (en) * 2020-08-12 2020-11-17 贝壳技术有限公司 Method, apparatus, medium, and electronic device for implementing clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DYLLIAN: "协同过滤推荐及相似性度量", 《HTTPS://MY.OSCHINA.NET/DILLAN/BLOG/164263》 *
白雪: "聚类分析中的相似性度量及其应用研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN112508134B (en) 2021-06-04

Similar Documents

Publication Publication Date Title
US11455322B2 (en) Classification of time series data
CN110825894A (en) Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium
CN109978006B (en) Face image clustering method and device
CN110750615A (en) Text repeatability judgment method and device, electronic equipment and storage medium
US11714834B2 (en) Data compression based on co-clustering of multiple parameters for AI training
CN109241511B (en) Electronic report generation method and equipment
CN115294328A (en) Target detection frame generation method and device, storage medium and electronic equipment
CN112508134B (en) Method, device, medium and electronic equipment for measuring similarity between sets
CN113114540B (en) Method for setting bandwidth predictor and adjusting service and related device
US20240176657A1 (en) Task processing method and apparatus, electronic device, storage medium and program product
CN113010525A (en) Ocean space-time big data parallel KNN query processing method based on PID
CN111626898A (en) Method, device, medium and electronic equipment for realizing attribution of events
WO2021118755A1 (en) Machine-learning for combinatorial decisioning
CN112465012A (en) Machine learning modeling method and device, electronic equipment and readable storage medium
CN113946566B (en) Web system fingerprint database construction method and device and electronic equipment
CN107368281B (en) Data processing method and device
CN114547384A (en) Resource object processing method and device and computer equipment
CN114418114A (en) Operator fusion method and device, terminal equipment and storage medium
CN113326064A (en) Method for dividing business logic module, electronic equipment and storage medium
CN113971224A (en) Image retrieval system, method and related equipment
CN110895590A (en) Candidate object acquisition method and device, electronic equipment and storage medium
CN111047415A (en) Clothing accessory order processing method, system, electronic equipment and storage medium
Iwata et al. Using an artificial neural network for predicting embedded software development effort
CN114547285B (en) Method and device for inferring meaning of table data, computer device and storage medium
CN110189234B (en) Hotel information adjusting method and device for OTA platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant