CN117332303B - Label correction method for clusters - Google Patents

Label correction method for clusters Download PDF

Info

Publication number
CN117332303B
CN117332303B CN202311630041.4A CN202311630041A CN117332303B CN 117332303 B CN117332303 B CN 117332303B CN 202311630041 A CN202311630041 A CN 202311630041A CN 117332303 B CN117332303 B CN 117332303B
Authority
CN
China
Prior art keywords
cluster
sample
meta
feature
clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311630041.4A
Other languages
Chinese (zh)
Other versions
CN117332303A (en
Inventor
祁纲
王语博
韩国权
李芳�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiji Computer Corp Ltd
Original Assignee
Taiji Computer Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiji Computer Corp Ltd filed Critical Taiji Computer Corp Ltd
Priority to CN202311630041.4A priority Critical patent/CN117332303B/en
Publication of CN117332303A publication Critical patent/CN117332303A/en
Application granted granted Critical
Publication of CN117332303B publication Critical patent/CN117332303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a label correction method for a cluster, which belongs to the technical field of label correction and comprises the following steps: performing unsupervised feature selection on each main cluster to obtain a first multi-dimensional label corresponding to each main cluster; creating meta clusters, obtaining meta clusters in each meta cluster, and performing dimension reduction processing on the first multi-dimension labels to obtain second multi-dimension labels of each meta cluster; obtaining sample characteristic information of samples to be distributed and the similarity between the sample characteristic information of each sample to be distributed and each element cluster, and distributing each sample to be distributed to the element cluster with the highest similarity to obtain a final cluster; and acquiring the final cluster characteristic information, and correcting the corresponding second multi-dimensional label. The first multi-dimensional label is obtained through unsupervised feature selection, the second multi-dimensional label is obtained through dimension reduction processing by re-clustering, and then the samples are distributed through similarity, so that the samples can be distributed to the meta cluster with the highest similarity, and the accuracy of the labels is guaranteed.

Description

Label correction method for clusters
Technical Field
The invention relates to the technical field of label correction, in particular to a label correction method for a cluster.
Background
At present, the correction of the labels is mainly performed in a manual correction mode, the utilization rate of automatic label correction is very low, labor force is wasted, the accuracy of the labels cannot be guaranteed, the labels can only be selected in a general direction and corresponding multidimensional labels are determined when the data acquisition is performed on the featureless clusters, the labels are required to be screened according to specific application scenes in the follow-up process, the dimension reduction processing is required, errors are easily generated during the dimension reduction processing, whether the final labels meet the requirements or not is required to be judged, and the error labels are corrected.
Accordingly, the present invention provides a tag correction method for a cluster.
Disclosure of Invention
The invention provides a label correction method for clusters, which is used for obtaining labels corresponding to each main cluster by carrying out unsupervised feature selection on the main clusters in an initial state, selecting the main clusters to create meta clusters to obtain meta clusters, distributing samples to obtain final clusters through similarity and correcting the labels, so that each sample can be finally located in the most suitable final cluster, and the accuracy of the labels is further ensured.
The invention provides a label correction method for a cluster, which comprises the following steps:
step 1: acquiring the number of main clusters in an initial state, performing unsupervised feature selection on each main cluster, and acquiring a first multi-dimensional label corresponding to each main cluster based on a selection result;
step 2: selecting a main cluster needing to be re-clustered based on the characteristic information of the cluster scene to create a meta-cluster, acquiring the meta-cluster in each meta-cluster, and performing dimension reduction processing on the first multi-dimension label based on the characteristic information corresponding to the meta-cluster in each meta-cluster to acquire a second multi-dimension label corresponding to each meta-cluster;
step 3: sample characteristic information of samples to be distributed and the similarity between the sample characteristic information of each sample to be distributed and each meta cluster are obtained, and each sample to be distributed is distributed to the meta cluster with the highest similarity to obtain a final cluster;
step 4: and acquiring the final cluster characteristic information of all final clusters correspondingly distributed to each meta-cluster, and correcting the corresponding second multi-dimensional labels based on the final cluster characteristic information.
In one possible implementation manner, the process of obtaining the number of the main clusters in the initial state and performing unsupervised selection of the features of each main cluster includes:
acquiring a first data set in an initial state, and classifying each data in the first data set based on a preset group type;
and taking the number of the classification results as the number of the main clusters, and determining the multidimensional characteristic of the corresponding main clusters by combining the data characteristic of each data in each classification result.
In one possible implementation manner, performing unsupervised feature selection on each main cluster, and obtaining a first multi-dimensional label corresponding to each main cluster based on a selection result, where the method includes:
constructing a feature selection model based on an unsupervised feature technology, and inputting the multidimensional feature and corresponding data of each main cluster into the feature selection model to obtain the information quantity of each feature;
ranking the information quantity of each feature in the same main cluster, and selecting the information quantity of N1 before ranking as a selection result;
and obtaining the feature types of all the features in the selection result of each main cluster, and determining the first multi-dimensional label of the corresponding main cluster based on the feature types.
In one possible implementation manner, selecting a main cluster to be re-clustered to create a meta-cluster based on feature information of a cluster scene, and obtaining a meta-cluster in each meta-cluster includes:
acquiring the characteristic information of an applied cluster scene and a corresponding cluster scene, acquiring part of main clusters with the characteristic information similarity exceeding the preset similarity between the characteristic information of all main clusters and the characteristic information of the corresponding cluster scene as main clusters needing to be re-clustered, and re-clustering to obtain meta clusters;
and carrying out integrated clustering on the element clusters, and obtaining the number of the element clusters and the characteristic information of the corresponding element clusters based on a clustering result.
In one possible implementation manner, performing dimension reduction processing on the first multidimensional label based on feature information corresponding to a meta-cluster in each meta-cluster to obtain a second multidimensional label corresponding to each meta-cluster, including:
acquiring main clusters associated with each metacluster and a quantity difference value of feature types contained in each main cluster and feature types corresponding to the metaclusters in the corresponding metaclusters, and performing dimension reduction processing on the corresponding main clusters based on the quantity difference value to obtain first dimension reduction labels of the corresponding main clusters;
and obtaining the repeatability of the feature types contained in all the main clusters associated with each meta-cluster, carrying out repeated screening processing on the first dimension reduction labels based on the repeatability, and obtaining the second dimension reduction labels of each meta-cluster based on the screening result.
In one possible implementation manner, obtaining sample feature information of samples to be allocated and similarity between the sample feature information of each sample to be allocated and each meta cluster, and allocating each sample to be allocated to a meta cluster with highest similarity to obtain a final cluster, where the method includes:
acquiring a second data set of a main cluster needing to be re-clustered, determining samples to be allocated based on the second data set, and extracting features of the corresponding samples to be allocated based on current cluster information of each sample to be allocated to obtain sample feature information of each sample to be allocated;
constructing sample feature vectors based on sample feature information of the samples to be distributed, calculating sample vector similarity between every two samples, and classifying the samples to be distributed based on the vector similarity between all the samples to obtain samples to be distributed in the same class;
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For the vector similarity between the ith sample and the jth sample,/and (ii) the vector similarity between the ith sample and the jth sample>For the parameter description intersection ratio between the ith sample feature vector and the jth sample feature vector,/>For the element set of the vector corresponding to the ith sample, < >>The j-th sample corresponds to the element set of the vector, and the element set comprises parameter descriptions and parameter values of each parameter description,/for>For the i-th sample and the j-th sample, based on the number of intersection elements of the parameter description, ++>The number of union elements based on the parameter description for the ith sample and the jth sample; />The number of intersection elements based on the parameter description and the parameter value for the i-th sample and the j-th sample; />The number of union elements based on the parameter description and the parameter value for the i-th sample and the j-th sample;element difference variance between vectors for the i-th sample and the j-th sample; />An average value of element difference variances between vectors of all any two samples to be allocated; />A distance value between the vector of the ith sample and the jth sample; />Is a similar adjustment factor based on the crossover result; max represents the maximum value symbol; />Representing the number of elements of the ith sample; />The number of elements representing the jth sample; />Is the parameter between the ith sample feature vector and the jth sample feature vectorNumber description and parameter value intersection proportion;
obtaining a center vector of each sample to be allocated in the same category, constructing a meta-cluster feature vector of each meta-cluster based on feature information of each meta-cluster, and calculating similarity between the center vector of each sample to be allocated in the same category and the feature vector of each meta-cluster to obtain vector similarity;
screening out the types of the samples to be distributed and the corresponding element clusters, wherein the vector similarity of the types of the samples to be distributed exceeds a preset similarity threshold, and if the samples to be distributed in the same type correspond to one element cluster, distributing the corresponding samples to be distributed to the corresponding element clusters to obtain a final cluster;
if the samples to be distributed in the same category correspond to two or more element clusters, selecting the element cluster with the highest vector similarity, and distributing the samples to be distributed in the corresponding category to the corresponding element cluster to obtain a final cluster;
if the similarity between the center vector of the sample to be allocated in the same category and the element cluster feature vector of each element cluster is lower than a preset similarity threshold, calculating the similarity between the sample feature vector of each sample to be allocated in the corresponding category and the element cluster feature vector of each element cluster, and allocating each sample to the element cluster with the highest similarity based on the calculation result to obtain a final cluster.
In one possible implementation manner, obtaining the final cluster feature information of all final clusters allocated corresponding to each meta-cluster includes:
sample characteristic information of a corresponding sample in each final cluster is obtained, and the final cluster characteristic information corresponding to each final cluster is determined;
and acquiring all final clusters in each meta-cluster, and performing information arrangement based on the final cluster characteristic information of the corresponding final clusters to obtain the final cluster characteristic information corresponding to all final clusters in each meta-cluster.
In one possible implementation, correcting the corresponding second multi-dimensional label based on the final cluster feature information includes:
acquiring a corresponding second multi-dimensional label of each metacluster, and determining the label characteristic of each metacluster based on the second multi-dimensional label;
and determining the corresponding relation between the label characteristics of each meta-cluster and the final cluster characteristics reflected in the final cluster characteristic information of the corresponding meta-cluster, and correcting the second multi-dimensional label based on the corresponding relation.
Compared with the prior art, the beneficial effects of this application are: the first multi-dimensional label corresponding to each main cluster is obtained by carrying out unsupervised feature selection on each main cluster, then the main cluster is selected by a cluster scene to carry out re-clustering to obtain a meta-cluster, the second multi-dimensional label is obtained by carrying out dimension reduction treatment, samples are distributed into proper meta-clusters through similarity to obtain final clusters, and then label correction is carried out, so that each sample can be accurately positioned in the most proper final cluster, and the labels of the corresponding meta-clusters can accurately reflect the features of all samples in all final clusters.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
fig. 1 is a flowchart of a tag correction method for a cluster according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
The embodiment of the invention provides a label correction method for a cluster, which comprises the following steps:
step 1: acquiring the number of main clusters in an initial state, performing unsupervised feature selection on each main cluster, and acquiring a first multi-dimensional label corresponding to each main cluster based on a selection result;
step 2: selecting a main cluster needing to be re-clustered based on the characteristic information of the cluster scene to create a meta-cluster, acquiring the meta-cluster in each meta-cluster, and performing dimension reduction processing on the first multi-dimension label based on the characteristic information corresponding to the meta-cluster in each meta-cluster to acquire a second multi-dimension label corresponding to each meta-cluster;
step 3: sample characteristic information of samples to be distributed and the similarity between the sample characteristic information of each sample to be distributed and each meta cluster are obtained, and each sample to be distributed is distributed to the meta cluster with the highest similarity to obtain a final cluster;
step 4: and acquiring the final cluster characteristic information of all final clusters correspondingly distributed to each meta-cluster, and correcting the corresponding second multi-dimensional labels based on the final cluster characteristic information.
In this embodiment, the primary clusters refer to classification of all data sets in the initial state, for example, 5 sets of data in the initial state, and there are 5 primary clusters in the initial state.
In this embodiment, the unsupervised feature selection is to remove the combination of four unsupervised feature selection techniques including low variance, multi-cluster feature selection (MCFS), spectral Feature Selection (SFS), and laplace score that do not significantly contribute to the clustering to select the features with the highest knowledge of information for clustering.
In this embodiment, the first multi-dimensional tag refers to a multi-dimensional tag of the main cluster in the initial state, where different dimensions of the tag represent feature information of different feature types existing in the main cluster.
In this embodiment, the meta-cluster is obtained after the main cluster is re-clustered according to the application scenario, for example, the application scenario of the meta-cluster needs feature type 1 and feature type 2, but the main cluster 1, the main cluster 2, the main cluster 3 all include feature type 1, feature type 2, feature type 3 and feature type 4, so that the main cluster 1, the main cluster 2 and the main cluster 3 need to be re-clustered according to the requirement of the application scenario and the data of the feature type 3 and the feature type 4 need to be discarded.
In this embodiment, the meta clusters are generated by computing and classifying data in meta clusters through a consensus function, and one meta cluster contains a plurality of different meta clusters.
In this embodiment, the dimension reduction processing is implemented by screening the labels corresponding to the feature types included in each meta-cluster, and the dimension reduction processing is performed on the labels corresponding to the feature types that exist in the first multi-dimension label but are not included in each meta-cluster in the meta-cluster, so that the second multi-dimension label can be obtained.
In this embodiment, the similarity is used to allocate each sample to be allocated, so as to ensure that the sample and the meta-cluster have the highest coincidence degree.
In this embodiment, the final cluster refers to a meta cluster obtained after each sample to be allocated is allocated and the samples contained therein.
In this embodiment, the final cluster feature information is feature information obtained by sorting feature information of all samples in the final cluster.
The beneficial effects of the technical scheme are as follows: the first multi-dimensional label corresponding to each main cluster is obtained by carrying out unsupervised feature selection on each main cluster, then the main cluster is selected by a cluster scene to carry out re-clustering to obtain a meta-cluster, the second multi-dimensional label is obtained by carrying out dimension reduction treatment, samples are distributed into proper meta-clusters through similarity to obtain final clusters, and then label correction is carried out, so that each sample can be accurately positioned in the most proper final cluster, and the labels of the corresponding meta-clusters can accurately reflect the features of all samples in all final clusters.
The embodiment of the invention provides a correction method for cluster labels, which comprises the following steps in the process of acquiring the number of main clusters in an initial state and performing unsupervised feature selection on each main cluster:
acquiring a first data set in an initial state, and classifying each data in the first data set based on a preset group type;
and taking the number of the classification results as the number of the main clusters, and determining the multidimensional characteristic of the corresponding main clusters by combining the data characteristic of each data in each classification result.
In this embodiment, the first data set contains all data information in the initial state.
In this embodiment, the preset group type refers to a main group type under the preset condition, so as to simply classify the data, and the preset group type may be more than the data type in the data set, for example, the data set contains picture information and audio information, and the preset group type may be a picture type, an audio type and a video type.
In this embodiment, the multidimensional feature refers to a feature conforming to a data dimension feature included in the main cluster, for example, the main cluster 1 includes data 1 and data 2, the dimension feature of the data 1 includes feature 1 of dimension 1 and feature 2 of dimension 1, the feature of the data 2 includes feature 2 of dimension 2 and feature 3 of dimension 3, and then the dimension feature of the main cluster 1 is a three-dimensional feature, including feature 1 of dimension 1, feature 2 of dimension 2, and feature 3 of dimension 3.
The beneficial effects of the technical scheme are as follows: the method comprises the steps of obtaining the data set in the initial state, classifying all data according to the preset cluster type to obtain a main cluster, determining the multidimensional feature of the corresponding main cluster according to all data features in the classification result, ensuring that the main cluster can contain all data without spare main clusters, ensuring the accuracy of the main cluster features, and providing convenience for subsequently obtaining the first multidimensional label.
The embodiment of the invention provides a correction method for cluster labels, which is used for carrying out unsupervised feature selection on each main cluster and obtaining a first multi-dimensional label corresponding to each main cluster based on a selection result, and comprises the following steps:
constructing a feature selection model based on an unsupervised feature technology, and inputting the multidimensional feature and corresponding data of each main cluster into the feature selection model to obtain the information quantity of each feature;
ranking the information quantity of each feature in the same main cluster, and selecting the information quantity of N1 before ranking as a selection result;
and obtaining the feature types of all the features in the selection result of each main cluster, and determining the first multi-dimensional label of the corresponding main cluster based on the feature types.
In this embodiment, the feature selection model is obtained by combining four non-supervised feature selection techniques, and all data in the first dataset may be passed through the feature selection model to select the feature with the highest information content for subsequent analysis.
In this embodiment, the information amount of each feature refers to how much data is in each feature, for example, 10 data are in feature 1, and the data amount of feature 1 is 10.
The beneficial effects of the technical scheme are as follows: the feature selection model is constructed through the unsupervised feature technology to obtain the information quantity of each feature and perform feature selection, the feature which does not significantly contribute to the subsequent clusters is effectively removed, the calculated quantity is reduced, the first multi-dimensional labels of the main clusters are determined through the feature types, and the first multi-dimensional labels of each main cluster can accurately reflect the features of the main clusters.
The embodiment of the invention provides a correction method for cluster labels, which comprises the following steps:
selecting a main cluster to be re-clustered to create a meta-cluster based on feature information of a cluster scene, and acquiring meta-clusters in each meta-cluster, wherein the method comprises the following steps:
acquiring the characteristic information of an applied cluster scene and a corresponding cluster scene, acquiring part of main clusters with the characteristic information similarity exceeding the preset similarity between the characteristic information of all main clusters and the characteristic information of the corresponding cluster scene as main clusters needing to be re-clustered, and re-clustering to obtain meta clusters;
and carrying out integrated clustering on the element clusters, and obtaining the number of the element clusters and the characteristic information of the corresponding element clusters based on a clustering result.
In this embodiment, when the number difference is positive and greater than 1, the dimension reduction processing is performed on the first multidimensional label, and the labels corresponding to the features, which are more than the features in the metacluster, of the first multidimensional label of the main cluster are removed, and the remaining labels are used as the first dimension reduction labels.
In this embodiment, the repeated filtering process is that one meta-cluster may correspond to a plurality of different main clusters, for example, the main cluster corresponding to the meta-cluster 1 has a main cluster 1 and a main cluster 2, the feature of the main cluster 1 that is more than the meta-cluster has a feature 1 and a feature 2, the first dimension-reduction tag of the main cluster 1 includes a tag 1 corresponding to the feature 1 and a tag 2 corresponding to the feature 2, the feature of the main cluster 2 that is more than the meta-cluster 1 has a feature 2, the first dimension-reduction tag of the main cluster 2 includes a tag 2 corresponding to the feature 2, and the first dimension-reduction tags of the main cluster 1 and the main cluster 2 need to be filtered repeatedly, and one tag 2 is removed.
In this embodiment, the second multi-dimensional label is obtained by performing dimension reduction processing on the first multi-dimensional label, for example, a main cluster corresponding to a meta-cluster includes a main cluster 1 and a main cluster 2, the label of the main cluster 1 includes a label 1, a label 2, a label 3 and a label 4, the label of the main cluster 2 includes a label 2 and a label 4, wherein the labels 1, 2, 3 and 4 correspond to the features 1, 2, 3 and 4 respectively, but only the feature 3 and the feature 4 in the meta-cluster need to be removed, and the second multi-dimensional label of the meta-cluster includes the label 3 and the label 4.
The beneficial effects of the technical scheme are as follows: and whether the main cluster needs to be subjected to dimension reduction processing is determined by acquiring the number difference value between the feature types, so that the workload is saved, the condition that the first dimension reduction label is not subjected to repeated deletion is ensured by carrying out repeated screening processing on the first dimension reduction label, and the second dimension reduction label is ensured to reflect the features of all samples in the meta cluster.
The embodiment of the invention provides a correction method for a cluster label, which is used for obtaining sample characteristic information of samples to be distributed and the similarity between the sample characteristic information of each sample to be distributed and each element cluster, distributing each sample to be distributed to the element cluster with the highest similarity to obtain a final cluster, and comprises the following steps:
acquiring a second data set of a main cluster needing to be re-clustered, determining samples to be allocated based on the second data set, and extracting features of the corresponding samples to be allocated based on current cluster information of each sample to be allocated to obtain sample feature information of each sample to be allocated;
constructing sample feature vectors based on sample feature information of the samples to be distributed, calculating sample vector similarity between every two samples, and classifying the samples to be distributed based on the vector similarity between all the samples to obtain samples to be distributed in the same class;
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For the vector similarity between the ith sample and the jth sample,/and (ii) the vector similarity between the ith sample and the jth sample>For the parameter description intersection ratio between the ith sample feature vector and the jth sample feature vector,/>For the element set of the vector corresponding to the ith sample, < >>The j-th sample corresponds to the element set of the vector, and the element set comprises parameter descriptions and parameter values of each parameter description,/for>For the i-th sample and the j-th sample, based on the number of intersection elements of the parameter description, ++>The number of union elements based on the parameter description for the ith sample and the jth sample; />The number of intersection elements based on the parameter description and the parameter value for the i-th sample and the j-th sample; />The number of union elements based on the parameter description and the parameter value for the i-th sample and the j-th sample; />Element difference variance between vectors for the i-th sample and the j-th sample; />An average value of element difference variances between vectors of all any two samples to be allocated; />A distance value between the vector of the ith sample and the jth sample; />Is a similar adjustment factor based on the crossover result; max represents the maximum value symbol; />Representing the number of elements of the ith sample; />The number of elements representing the jth sample; />The parameter description and the parameter value intersection proportion between the ith sample feature vector and the jth sample feature vector are given;
obtaining a center vector of each sample to be allocated in the same category, constructing a meta-cluster feature vector of each meta-cluster based on feature information of each meta-cluster, and calculating similarity between the center vector of each sample to be allocated in the same category and the feature vector of each meta-cluster to obtain vector similarity;
screening out the types of the samples to be distributed and the corresponding element clusters, wherein the vector similarity of the types of the samples to be distributed exceeds a preset similarity threshold, and if the samples to be distributed in the same type correspond to one element cluster, distributing the corresponding samples to be distributed to the corresponding element clusters to obtain a final cluster;
if the samples to be distributed in the same category correspond to two or more element clusters, selecting the element cluster with the highest vector similarity, and distributing the samples to be distributed in the corresponding category to the corresponding element cluster to obtain a final cluster;
if the similarity between the center vector of the sample to be allocated in the same category and the element cluster feature vector of each element cluster is lower than a preset similarity threshold, calculating the similarity between the sample feature vector of each sample to be allocated in the corresponding category and the element cluster feature vector of each element cluster, and allocating each sample to the element cluster with the highest similarity based on the calculation result to obtain a final cluster.
In this embodiment, the second data set refers to a set of all data contained in all primary clusters that need to be re-clustered, and all samples contained in the second data set are samples to be allocated.
In this embodiment, the cluster information refers to meta cluster information where a sample to be allocated is located before it is not allocated.
In this embodiment, feature extraction is a process of analyzing cluster information of a meta cluster where a sample to be allocated is not allocated to obtain all feature information contained in the cluster information, and classifying all feature information again according to different samples to obtain sample feature information.
In this embodiment, the sample feature vector is composed of specific data of different feature types of the sample and corresponding feature types reflected by the sample feature information.
In this embodiment, the first vector similarity refers to the similarity of sample feature vectors corresponding to different two samples, and dividing the vector similarity into different categories according to different intervals is a category dividing process, and samples obtained after category division are samples to be allocated in the same category.
In this embodiment, the center vectors of the samples to be allocated in the same class may be obtained by summing the sample feature vectors corresponding to all the samples to be allocated in the same class and then obtaining the average value.
In this embodiment, the meta-cluster feature vector refers to the standard feature type of the corresponding meta-cluster and the standard data of the corresponding standard feature type reflected by all the standard feature information in the meta-cluster, and the vector similarity refers to the similarity between the center vector of each sample to be allocated in the same category and the meta-cluster feature vector of each meta-cluster.
In this embodiment, the final cluster is a cluster obtained after all samples to be allocated are finally allocated according to the vector similarity.
The beneficial effects of the technical scheme are as follows: the method has the advantages that the samples to be distributed are determined according to the data of the main clusters needing to be re-clustered, the characteristic information of the samples to be distributed is obtained to construct sample characteristic vectors, the samples to be distributed in the same category are distributed simultaneously according to the similarity among the characteristic vectors, the distribution efficiency is effectively improved, the distribution time is saved, and the samples which cannot be distributed simultaneously are distributed according to the vector similarity of the independent samples and the element clusters, so that each sample can be distributed to the element cluster with the highest similarity to obtain a final cluster.
The embodiment of the invention provides a label correction method for clusters, which is used for acquiring the final cluster characteristic information of all final clusters correspondingly distributed by each meta-cluster and comprises the following steps:
sample characteristic information of a corresponding sample in each final cluster is obtained, and the final cluster characteristic information corresponding to each final cluster is determined;
and acquiring all final clusters in each meta-cluster, and performing information arrangement based on the final cluster characteristic information of the corresponding final clusters to obtain the final cluster characteristic information corresponding to all final clusters in each meta-cluster.
In this embodiment, the final cluster feature information is collated from sample feature information of each sample in the final cluster.
In this embodiment, feature information of each final cluster in the same meta-cluster is integrated to obtain final cluster feature information corresponding to more final clusters in each meta-cluster.
The beneficial effects of the technical scheme are as follows: the final cluster characteristic information of each corresponding final cluster is obtained through the sample characteristic information of each sample, the comprehensiveness of the characteristic information of each final cluster is guaranteed, and the final cluster characteristic information corresponding to all final clusters in each meta-cluster is obtained through integrating the final cluster characteristic information of all final clusters contained in each meta-cluster without subsequent label correction.
The embodiment of the invention provides a label correction method for a cluster, which corrects a corresponding second multi-dimensional label based on the final cluster characteristic information, and comprises the following steps:
acquiring a corresponding second multi-dimensional label of each metacluster, and determining the label characteristic of each metacluster based on the second multi-dimensional label;
and determining the corresponding relation between the label characteristics of each meta-cluster and the final cluster characteristics reflected in the final cluster characteristic information of the corresponding meta-cluster, and correcting the second multi-dimensional label based on the corresponding relation.
The beneficial effects of the technical scheme are as follows: all final cluster features reflected in the final cluster feature information corresponding to all final clusters in each meta-cluster are obtained and compared with the features reflected in the second multi-dimensional labels of the corresponding meta-clusters, and then correction is carried out through the comparison result, so that the corrected labels can completely and accurately reflect all the features of the corresponding meta-clusters.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (5)

1. A method of tag correction for a cluster, comprising:
step 1: acquiring the number of main clusters in an initial state, performing unsupervised feature selection on each main cluster, and acquiring a first multi-dimensional label corresponding to each main cluster based on a selection result;
step 2: selecting a main cluster needing to be re-clustered based on the characteristic information of the cluster scene to create a meta-cluster, acquiring the meta-cluster in each meta-cluster, and performing dimension reduction processing on the first multi-dimension label based on the characteristic information corresponding to the meta-cluster in each meta-cluster to acquire a second multi-dimension label corresponding to each meta-cluster;
step 3: sample characteristic information of samples to be distributed and the similarity between the sample characteristic information of each sample to be distributed and each meta cluster are obtained, and each sample to be distributed is distributed to the meta cluster with the highest similarity to obtain a final cluster;
step 4: acquiring final cluster characteristic information of all final clusters correspondingly distributed to each meta-cluster, and correcting the corresponding second multi-dimensional labels based on the final cluster characteristic information;
in step 4, correcting the corresponding second multi-dimensional label based on the final cluster feature information includes:
acquiring a corresponding second multi-dimensional label of each metacluster, and determining the label characteristic of each metacluster based on the second multi-dimensional label;
determining a corresponding relation between the tag characteristic of each meta-cluster and the final cluster characteristic reflected in the final cluster characteristic information of the corresponding meta-cluster, and correcting the second multi-dimensional tag based on the corresponding relation;
in step 3, sample feature information of samples to be distributed and similarity between the sample feature information of each sample to be distributed and each meta cluster are obtained, and each sample to be distributed is distributed to the meta cluster with the highest similarity to obtain a final cluster, which comprises the following steps:
acquiring a second data set of a main cluster needing to be re-clustered, determining samples to be allocated based on the second data set, and extracting features of the corresponding samples to be allocated based on current cluster information of each sample to be allocated to obtain sample feature information of each sample to be allocated;
constructing sample feature vectors based on sample feature information of the samples to be distributed, calculating sample vector similarity between every two samples, and classifying the samples to be distributed based on the vector similarity between all the samples to obtain samples to be distributed in the same class;
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For the vector similarity between the ith sample and the jth sample,/and (ii) the vector similarity between the ith sample and the jth sample>For the parameter description intersection ratio between the ith sample feature vector and the jth sample feature vector,/>For the parameter description and the parameter value intersection ratio between the ith sample feature vector and the jth sample feature vector, +.>For the element set of the vector corresponding to the ith sample, < >>The j-th sample corresponds to the element set of the vector, and the element set comprises parameter descriptions and parameter values of each parameter description,/for>For the i-th sample and the j-th sample, based on the number of intersection elements of the parameter description, ++>The number of union elements based on the parameter description for the ith sample and the jth sample; />The number of intersection elements based on the parameter description and the parameter value for the i-th sample and the j-th sample; />Parameter-based tracing for the ith sample and the jth sampleThe number of union elements that refer to the parameter values;element difference variance between vectors for the i-th sample and the j-th sample; />An average value of element difference variances between vectors of all any two samples to be allocated; />A distance value between the vector of the ith sample and the jth sample; />Is a similar adjustment factor based on the crossover result; max represents the maximum value symbol; />Representing the number of elements of the ith sample; />The number of elements representing the jth sample;
obtaining a center vector of each sample to be allocated in the same category, constructing a meta-cluster feature vector of each meta-cluster based on feature information of each meta-cluster, and calculating similarity between the center vector of each sample to be allocated in the same category and the feature vector of each meta-cluster to obtain vector similarity;
screening out the types of the samples to be distributed and the corresponding element clusters, wherein the vector similarity of the types of the samples to be distributed exceeds a preset similarity threshold, and if the samples to be distributed in the same type correspond to one element cluster, distributing the corresponding samples to be distributed to the corresponding element clusters to obtain a final cluster;
if the samples to be distributed in the same category correspond to two or more element clusters, selecting the element cluster with the highest vector similarity, and distributing the samples to be distributed in the corresponding category to the corresponding element cluster to obtain a final cluster;
if the similarity between the center vector of the sample to be allocated in the same category and the element cluster feature vector of each element cluster is lower than a preset similarity threshold, calculating the similarity between the sample feature vector of each sample to be allocated in the corresponding category and the element cluster feature vector of each element cluster, and allocating each sample to the element cluster with the highest similarity based on the calculation result to obtain a final cluster;
in step 1, the process of obtaining the number of the main clusters in the initial state and performing unsupervised feature selection on each main cluster includes:
acquiring a first data set in an initial state, and classifying each data in the first data set based on a preset group type;
the number of the classification results is used as the number of the main clusters, and the multi-dimensional characteristics of the corresponding main clusters are determined by combining the data characteristics of each data in each classification result;
wherein the first dataset comprises picture information and audio information;
the preset group type is a picture type, an audio type and a video type.
2. The method for correcting labels of clusters according to claim 1, wherein in step 1, unsupervised feature selection is performed on each main cluster, and a first multi-dimensional label corresponding to each main cluster is obtained based on a selection result, which includes:
constructing a feature selection model based on an unsupervised feature technology, and inputting the multidimensional feature and corresponding data of each main cluster into the feature selection model to obtain the information quantity of each feature;
ranking the information quantity of each feature in the same main cluster, and selecting the information quantity of N1 before ranking as a selection result;
and obtaining the feature types of all the features in the selection result of each main cluster, and determining the first multi-dimensional label of the corresponding main cluster based on the feature types.
3. The method for correcting labels of clusters according to claim 1, wherein in step 2, selecting a primary cluster to be re-clustered to create a meta-cluster based on feature information of a cluster scene, obtaining meta-clusters in each meta-cluster includes:
acquiring the characteristic information of an applied cluster scene and a corresponding cluster scene, acquiring part of main clusters with the characteristic information similarity exceeding the preset similarity between the characteristic information of all main clusters and the characteristic information of the corresponding cluster scene as main clusters needing to be re-clustered, and re-clustering to obtain meta clusters;
and carrying out integrated clustering on the element clusters, and obtaining the number of the element clusters and the characteristic information of the corresponding element clusters based on a clustering result.
4. The method for correcting labels of clusters according to claim 1, wherein in step 2, the dimension reduction processing is performed on the first multi-dimension label based on the feature information corresponding to the meta-clusters in each meta-cluster to obtain a second multi-dimension label corresponding to each meta-cluster, including:
acquiring main clusters associated with each metacluster and a quantity difference value of feature types contained in each main cluster and feature types corresponding to the metaclusters in the corresponding metaclusters, and performing dimension reduction processing on the corresponding main clusters based on the quantity difference value to obtain first dimension reduction labels of the corresponding main clusters;
and obtaining the repeatability of the feature types contained in all the main clusters associated with each meta-cluster, carrying out repeated screening processing on the first dimension reduction labels based on the repeatability, and obtaining the second dimension reduction labels of each meta-cluster based on the screening result.
5. The method for tag correction of claim 1, wherein in step 4, obtaining final cluster feature information of all final clusters allocated corresponding to each meta-cluster includes:
sample characteristic information of a corresponding sample in each final cluster is obtained, and the final cluster characteristic information corresponding to each final cluster is determined;
and acquiring all final clusters in each meta-cluster, and performing information arrangement based on the final cluster characteristic information of the corresponding final clusters to obtain the final cluster characteristic information corresponding to all final clusters in each meta-cluster.
CN202311630041.4A 2023-12-01 2023-12-01 Label correction method for clusters Active CN117332303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311630041.4A CN117332303B (en) 2023-12-01 2023-12-01 Label correction method for clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311630041.4A CN117332303B (en) 2023-12-01 2023-12-01 Label correction method for clusters

Publications (2)

Publication Number Publication Date
CN117332303A CN117332303A (en) 2024-01-02
CN117332303B true CN117332303B (en) 2024-03-26

Family

ID=89279774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311630041.4A Active CN117332303B (en) 2023-12-01 2023-12-01 Label correction method for clusters

Country Status (1)

Country Link
CN (1) CN117332303B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101880628B1 (en) * 2017-11-27 2018-08-16 한국인터넷진흥원 Method for labeling machine-learning dataset and apparatus thereof
CN110046586A (en) * 2019-04-19 2019-07-23 腾讯科技(深圳)有限公司 A kind of data processing method, equipment and storage medium
CN110457155A (en) * 2019-07-31 2019-11-15 清华大学 A kind of modification method, device and the electronic equipment of sample class label
CN115439887A (en) * 2022-08-26 2022-12-06 三维通信股份有限公司 Pedestrian re-identification method and system based on pseudo label optimization and storage medium
CN115687621A (en) * 2022-11-07 2023-02-03 中国农业银行股份有限公司 Short text label labeling method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446399A (en) * 2019-09-02 2021-03-05 华为技术有限公司 Label determination method, device and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101880628B1 (en) * 2017-11-27 2018-08-16 한국인터넷진흥원 Method for labeling machine-learning dataset and apparatus thereof
CN110046586A (en) * 2019-04-19 2019-07-23 腾讯科技(深圳)有限公司 A kind of data processing method, equipment and storage medium
CN110457155A (en) * 2019-07-31 2019-11-15 清华大学 A kind of modification method, device and the electronic equipment of sample class label
CN115439887A (en) * 2022-08-26 2022-12-06 三维通信股份有限公司 Pedestrian re-identification method and system based on pseudo label optimization and storage medium
CN115687621A (en) * 2022-11-07 2023-02-03 中国农业银行股份有限公司 Short text label labeling method and device

Also Published As

Publication number Publication date
CN117332303A (en) 2024-01-02

Similar Documents

Publication Publication Date Title
CN109189876B (en) Data processing method and device
CN112257801B (en) Incremental clustering method and device for images, electronic equipment and storage medium
Yu et al. Automatic interesting object extraction from images using complementary saliency maps
CN112269818B (en) Equipment parameter root cause positioning method, system, device and medium
CN106780639B (en) Hash coding method based on significance characteristic sparse embedding and extreme learning machine
EP3835976A1 (en) Method and device for data retrieval
CN110688888B (en) Pedestrian attribute identification method and system based on deep learning
CN108154132A (en) A kind of identity card text extraction method, system and equipment and storage medium
CN111723856A (en) Image data processing method, device and equipment and readable storage medium
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
CN111882034A (en) Neural network processing and face recognition method, device, equipment and storage medium
CN112966687B (en) Image segmentation model training method and device and communication equipment
CN117332303B (en) Label correction method for clusters
CN103503469B (en) The categorizing system of element stage by stage
CN112508000A (en) Method and equipment for generating OCR image recognition model training data
CN103793714A (en) Multi-class discriminating device, data discrimination device, multi-class discriminating method and data discriminating method
CN117294727A (en) Cloud edge end collaborative management method based on cloud primordia and container technology
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN111274236A (en) Missing data filling method and device based on intelligent ship database
CN111581298A (en) Heterogeneous data integration system and method for large data warehouse
CN112132239B (en) Training method, device, equipment and storage medium
CN113408482A (en) Training sample generation method and device
US10311084B2 (en) Method and system for constructing a classifier
CN114201999A (en) Abnormal account identification method, system, computing device and storage medium
CN112559590A (en) Mapping data resource processing method and device and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant